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Abstract 

This paper proposes confidence regions for the identified set in conditional mo- 
ment inequahty models using Kolmogorov-Smirnov statistics with a truncated inverse 
variance weighting with increasing truncation points. The new weighting differs from 
those proposed in the hterature in two important ways. First, confidence regions based 
on KS tests with the weighting function I propose converge to the identified set at 
a faster rate than existing procedures based on bounded weight functions in a broad 
class of models. This provides a theoretical justification for inverse variance weighting 
in this context, and contrasts with analogous results for conditional moment equali- 
ties in which optimal weighting only affects the asymptotic variance. Second, the new 
weighting changes the asymptotic behavior, including the rate of convergence, of the 
KS statistic itself, requiring a new asymptotic theory in choosing the critical value, 
which I provide. To make these comparisons, I derive rates of convergence for the 
confidence regions I propose along with new results for rates of convergence of exist- 
ing estimators under a general set of conditions. A series of examples illustrates the 
broad applicability of the conditions. A monte carlo study examines the finite sample 
behavior of the confidence regions. 
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1 Introduction 



This paper proposes methods for inference in conditional moment inequahty models and 
derives new relative efficiency results for these models to show that these methods are more 
efficient than available methods in a certain precise sense. Formally, these models are defined 
by a restriction of the form Ep{m{Wi,6)\Xi) > almost surely. Here, m is a known para- 
metric function, which may be vector valued (in which case the inequality is interpreted as 
elementwise). This setup includes many models commonly used in econometrics, including 
regression models with endogenously censored or missing data, selection models, and certain 
models of firm and consumer behavior. 

The problem is to perform inference on the identified set 

eo(P) = {e\Ep{m{W,,e)\Xi) > O a.s.} 

given a sample (Xi, Wi), . . . , (X„, Wn) from P. This paper proposes confidence regions C„ 
that satisfy 

liminf inf PfGofP) C C„) > 1 - a (1) 

for classes of probability distributions V restricted only by mild regularity conditions. For 
these confidence regions and several confidence regions available in the literature satisfying 
this requirement, I derive rates of convergence of C„ to 6o(-P). The results give sequences 
a„, which depend on the smoothness of V and the method used to construct C„, such that 

sup P{dHi9oiP),Cn) > an) ^ 0. (2) 

These results show that, in a general class of models, the confidence regions proposed here 
are the only ones to obtain the best rate in (j2]) for a variety of classes V defined by 
different smoothness conditions without prior knowledge of V. In this sense, the confidence 
regions proposed here are adaptive. 

The confidence regions proposed in this paper are based on a Kolmogorov-Smirnov (KS) 
statistic weighted by a truncation of the inverse of the s ample variance with an increasing 



sequen ce o 



truncation points. Follow ing the approach of IChernozhukov. Hong, and Tamei 



( 120071 ) and lRomano and Shaikhl ( l2010l ). the confidence regions invert these tests using critical 
values that control the familywise error rate over parameter values in the identified set, 
resulting in a set that satisfies ([1]). The increasing sequence of truncation points I propose 
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changes the asymptotic behavior and, in particular, the rate of convergence of the KS statistic 
relative to the bounded weightings proposed in the literature. This requires a new asymptotic 
theory in choosing the critical value, which I develop. I derive the rate of convergence to 
the identified set for these confidence regions under conditions that apply to a broad class of 
models while still being interpretable. Since general results for rates of convergence to the 
identified set have not been derived for confidence regions based on kernel methods or KS 
statistics with bounded weights, I derive rates of convergence for confidence regions based 
on these existing approaches as well. For the class of models I consider, I find that using the 
inverse variance with increasing trunaction points as the weight function in the KS statistic 
results in a confidence region for the identified set that has a faster rate of convergence 
to the identified set than the KS statistic based confidence regions with bounded weights 
proposed in the literature, and achieves the same rate of convergence as a kernel estimate 
with the optimal bandwidth. For classes of underlying distributions in which smoothness of 
two derivativ es or less is imposed, these rates correspond with the upper bounds derived by 



Stond (119821 ) for estimating conditional means. 



To my knowledge, these results provide the first theoretical justification for weighting 
moments by their variance in conditional moment inequality problems. If the truncation 
parameter is allowed to increase fast enough, weighting by the variance in the KS objective 
function increases the rate of convergence of confidence regions to the identified set under 
the conditions I consider. Given that numerous negative results exist for similar problems, it 
might be surprising that such general results on relative efficiency could be obtained. For one, 
the tests procedures I compare are adapted from nonparametric goodness of fit tests. The 
general concensus in this literature is that the relative efficiency of these tests will depend on 
the particular situation, and that, while power results can be obtained for certain types of 
alternatives, one cannot make any broad conclusions about which tests are more powerful. 
An important insight of this paper is that, although one cannot make a general statement 
about one procedure being optimal against all possible alternatives in every setting, most 
conditional moment inequality models used in practice place restrictions on how parameter 
values not in the identified set translate to the conditional moment restriction being violated. 
One of the contributions of this paper is to propose a set of interpretable conditions under 
which the truncated variance weighting proposed in this paper is most efficient, and to show 
that several models used in practice satisfy them. 

A second reason that relative efficiency in this setting might seem like an intractable 
problem is that, even for the seemingly simpler problem of inference based on finitely many 
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moment inequalities, no relative efficiency results in terms of local power comparisons have 
been develo ped. Indeed, the lack of such results has motivated interest in large deviations 
optimality ( ICanayl . 12010), which are of particular interest when local power comparisons do 
not give a clear recommendation. This paper makes progress in a seemingly more difficult 
problem by showing that, while power comparisons in models with unconditional moment 
inequalities involve subtle issues of how relative efficiency should be defined for inference on 
sets, power comparisons for conditional moment inequalities can be made with the coarser 
comparison of rates of convergence to the identified set. Since different approaches to infer- 
ence on the identified set lead to different rates of convergence to the identified set, comparing 
rates of convergence leads to clear recommendations of which estimator to use. 

Part of the intuition for the efficiency of the inverse variance weighting proposed in this 
paper relative to other methods is similar to the intuition for why weighting by the inverse 
of the variance matrix in the GMM objective function improves the asymptotic variance of 
GMM estimators. Moments that can be estimated more accurately should be given more 
weight. However, as I describe in more detail in the body of the paper, the result is also 
related to the choice of bandwidth in kernel estimation. The KS statistics for moment in- 
equality models I consider take the supremum of an infinite number of unconditional moment 
inequalities that together are equivalent to the conditional moment inequality. Under the 
conditions in this paper, local alternatives violate a sequence of unconditional moments that 
behave like means of kernel functions under a decreasing sequence of bandwidths. Weighting 
by the inverse of the variance allows the KS statistic to automatically choose the uncondi- 
tional moments that correspond to the optimal bandwidth, while controlling the probability 
of type I error even when smoothness conditions needed for kernel estimation do not hold. 

One interpretation of this result is that inverse variance weighting results in a test that is 
adaptive to smoothness conditions on the conditional mean. Indeed, the rates of convergence 
to the identified set derived in this paper coincide with the optimal rates of convergence for 
estimates of coriditiona l means under Lipschitz condition or a bounded second derivative 
derived in IStond (Il982l ). The confidence regions proposed in this paper are also adaptive to 



Holder conditions and intermediate levels of smoothness. Thus, this paper draws a connec- 
tion between optimal weighting functions and adaptive estimation. 

Another way of describing the intuition for the better rate of convergence with variance 
weighting is that it helps alleviate a nonsimilarity probl em with KS statistics applied to 
conditional moment inequality problems. As shown by [Armstrong! (120111 ). KS statistics 
with bounded weights will converge at different rates on the boundary of the identified set 
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depending on the shape of the conditional mean. The results from that paper can be used to 
improve the power of tests based on bounded weights, but require pre tests to determing the 
rate of convergence of the test statistic. The weight functions I propose in this paper scale 
up low variance moments so that the KS statistic will be of the same order of magnitude 
whether the supremum is achieved at a low or high variance moment. This makes the 
procedures proposed in this paper more powerful against sequences of alternative parameter 
values that determine rates of convergence in the Hausdorff metric, leading to a faster rate 
of convergence for the confidence region even when a worst case critical value is used. 

The results in this paper show that, in certain smoothness classes, confidence regions 
based on the methods in this paper achieve the best rate of convergence to the identified 
set in the Hausdorff metric. While other methods achieve the same rate of convergence if 
prior information is known about the shape of the conditional mean, these methods will do 
much worse if incorrect prior information is used to choose a different approach. A succinct 
way of putting this is that, among the approaches considered here, the approach based on 
inverse variance weighted KS statistics has the optimal minimax rate for a broad set of 
smoothness classes. While minimax definitions of relative efficiency are useful, they ignore 
the possibility that, while the inverse variance weighting approach is better in the worst case 
in a particular class of distributions, other approaches might do much better under more 
favorable data generating processes. However, the results in Section [6] show that, even in 
a very restrictive set of cases that are more favorable for the approach based on bounded 
weights, the inverse variance weighting proposed in this paper will only lose a log n term in 
the rate of convergence to the identified set relative to the rate of convergence using bounded 
weights. This contrasts with the polynomial differences in rates of convergence in cases where 
bounded weights or kernel based methods do worse. 



The sets considered in this paper are confidence sets in the sense of lChernozhukov. Hong, and Tamer 



(120071 ). since they contain the identified set with a prespecified probability asymptotically. 
One can also interpre t these sets as outwardly biased e s timat es of the identified set, similar 
to those proposed by Ichernozhukov. Hong, and Tamer Throughout the paper, I re- 

fer to these sets interchangeably as confidence regions and as estimates of the identified set. 
Interpreting these sets as confidence regions, the rates of convergence in the Hausdorff metric 
derived in this paper are a measure of the power of these tests against local alternatives. 
The rates of convergence derived here imply local power results for sequences of parameter 
values that approach the boundary of the identified set. In addition to the confidence re- 
gions considered here that contain the entire identified set, methods similar to those used 
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in this paper c ould be used to cons t ruct c onfidence regions for points in tlie identified set, 
as proposed by llmbens and Manskil (120041 ) . Local power results for tests satisfying this less 
stringent requirement would follow from similar arguments. 

The new class of weightings proposed in this paper leads to a nontri yial change in the 



Andrews and Shi 



behav ior of the s tatist ic. Whereas the KS type statistics considered by 
J2OO9I ) and IkiJ J2OO8I ) are defined as the supremum of a random process that converges 



to a tight random process, this does not hold with the increasing truncation points for the 
inverse variance weighting used here. Thus, while the statistics using bounded weights can 



be handled using functional centra 
in 



limit theorems in the supremum norm, such as those 
van der Vaart and Wellnerl (1l996l ). such results do not apply for the weighting functions 
in this paper. To overcome this, I use maximal inequalities that bound the supremum 
of a random process by a function of the maximal variance of the process. The asymptotic 
bo unds on t he sam pling distribution of the statistic with the new weighting follow arguments 



m 



PoUardl (Il984j ). with some slight modifications to obtain uniformity in the underlying 



distribution. A disadvantage of this approach is that it only leads to an upper bound on the 
critical value for the test statistic, leading to conservative inference. While this is also the 
case for many procedures in the moment inequalities setting, it would be useful to extend 
these results to derive less conservative critical values. On the other hand, the local power 
results in this paper show that, even with these conservative critical values, confidence sets 
based on the weighting proposed in this paper converge to the identified set at a faster rate 
than confidence regions based on bounded weightings. 

This paper relates to the recent literature on econometric models defined by moment in- 
equalities and, in particular, conditional moment inequalities wher e the conditioning var iable 
is continuously distributed. 



An 



drews and Shil koO^j ). IkiJ f|2008k iMenzell f|2008l . I2OI0I ) and 



Chernozhukov. Lee, and RosenI (120091 ) treat this problem in different ways. The estimators 



of the identified s e t con sidere d in t he pre sent paper are most similar to those considered by 



Andrews and Shil ([20091) and iKimI (l2008l ). the only major difference being the magnitude of 



a truncation parameter relative to the sample size. One of the contributions of this paper is 
to show how allowing the truncation parameter to change with the sample size changes the 
behavior of the KS statistic in nontrivial ways, and how to use this to form set estimates 
that, in a broad class of models, converge to the identified set at a faster rate. In addi- 
tion, the rates of convergence to the identified set for some of these approaches derived in 
the present paper are the first local power results for these methods that apply generically 
to conditional moment inequality models in the set identified case. These estimators and 
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inference procedures build on the idea of transforming; conditional morn e nt ine qualities to 
unconditional moment inequalities, which was used by iKhan and Tamerl (120091 ) to propose 
estimates for a point identified model. Their setting differs from most of those considered 
here in t hat their model is point ide ntified with a root-n rate of convergence for the point 



estimate. 



Galichon and Henryl (120091 ) propose a similar statistic for a class of models under 



a different setup with possible lack of point identification. 

More broadly, this paper relates to the literature on set identified models. Much of this 



that treat this problem include Andrews, Berrv. and Ji 
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Jaoosl): 

The paper is organized as follows. In Section [21 I describe the estimation problem and 
estimators of the identified set, and give an informal description of some of the results in 
the paper and the intuition behind them. In Section [3l I state conditions under which the 
estimate contains the identified set with probability approaching one. In Section |U I state 
conditions for consistency and rates of convergence. In Section [5l I verify the conditions of 
Section m in some examples. In Section |6l I derive rates of convergence of other estimators of 
the identified set and compare them to rates of convergence for the estimators proposed in 
this paper. Section [7]reports the results of a monte carlo study of the finite sample properties 
of the estimators. Section |8] concludes, and an appendix contain proofs and additional results 
referred to in the body of the paper. 

I use the following notation throughout the paper. For observations (Xi, Wi), . . . , (X„, Wn) 
and a measurable function h on the sample space, Enh{Xi, Wi) = ^ Yl^=i Wi) denotes 
the sample mean and Eph{Xi, Wi) denotes the mean of h{Xi, Wi) under the probability 
measure P. The support of a random variable Xi under a probability measure P is denoted 
suppp(Xj). I use double subscripts to denote elements of vector observations so that Xjj 
denotes the jth component of the ith observation X,. For a vector x G M'^, use the notation 
X-i to denote the vector (xi, . . . , Xi-i, Xj+i, . . . , Xk)'- Inequalities on Euclidean space refer to 
the partial ordering of elementwise inequality. I use a A 6 to denote the elementwise min- 
imum and a V 6 to denote the elementwise maximum of a and b. For a norm || ■ || on M'^, 
lltll = lltAOll. Unless otherwise noted, 11 ■ 11 denotes the Euclidean norm. 
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2 Setup and Informal Description of Results 



We observe iid observations (Xi, Wi), . . . , {Xn, Wn) distributed according to some probability 
distribution P and wish to perform inference on the identified set 0o(-P) of parameters 
^ G C M*^ that satisfy the conditional moment inequalities 

Ep[m{Wi,e)\Xi] > P-a.s. 

Here, Xi and Wi are random variables on M"'-^ and M'^^ respectively, and 

is a measurable function. See Section for examples of econometric models that fit into this 

framework. In what follows, Th{6,x,P) will denote a version of Ep[m{Wi,6)\Xi = x]. 

I consider inference on Qo{P) using a standard deviation weighted KS statistic defined as 
follows. Let ^ be a class of functions from M'^^ to IR+. Let fipj{6,g) = Epmj{Wi,6)gj{Xi) 
and apj{e,g) = {Ep[mj{W„e)gj{X,)]^ - [Epmj{Wi,e)gj{Xi)]'^y/^ and define the sam- 
ple analogues fin,j{0,g) = Enmj{Wi,9)gj{Xi) and an,j{9,g) = {En[mj{Wi,9)gj{Xi)]^ - 
[EnUijiWi, 9)gj{Xi)]'^Y^'^. Since the functions in Q are nonnegative, Ep[m{Wi, 9)\Xi = x] > 
for all X implies that fJ'Pj{9, g) = Epmj{Wi, 9)gj{Xi) is nonnegative for all g and j. The KS 
statistics in this paper are designed to be positive and large in magnitude when one of these 
moments is small (negative and large in magnitude). For a fixed function S : M*^^ — ?• IR+ 
chosen by the researcher, the KS statistic is defined as 



T49) = snpS 



crn,i{9,g) V o-„,' " ' ' d-n,dY{9,g) V o", 



where (T„ is a sequence of truncation points. Here, is a function that is positive and large 
in magnitude when one of its arguments is negative and large in magnitude. Possible choices 
include t h- )■ or, more generally, any function that satisfies Assumption 13. 4[ given in 
Section [31 If T„(6') is positive and large in magnitude, this is evidence that fipj{9,g) is 
negative for some j and g, so that 9 is not in the identified set. 

The set estimates in this paper invert this test statistic usin g critical values that control 



the pr obability of false rejection uniformly over G, as proposed by lChernozhukov. Hong, and Tamer 



(120071 ). For some data dependent value c„, the confidence region Cn{cn) for the identified set 



is defined as 

CJcr.) = <9ee 



^/\ogn 
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Defining the critical relative to the scaling anticipates results on the rate of convergence 
of Tn{0) stated in what follows. 

2.1 Intuition for the Results 

To describe the intuition behind the results in this paper, consider a special case of the 
KS statistic based confidence regions I treat in this paper applied to a particular model. 
Consider an interval regression model, in which we posit a linear conditional mean for a 
latent variable W* given an observed variable Xj, Ep{W*\X.i) = 9i+Xl9^i, but only observe 
intervals known to contain W*. Here, Xi is a continuously distributed random variable on 
M*^^ . While surveys that elicit interval responses are an obvious application, this encompasses 
other forms of incomplete data including selection models and missing data (see Section 15.51 
for an example). I give a more thorough treatment of this model in Sections 15. II and 15.21 To 
keep things simple, suppose that we only observe a one sided interval containing W*. That 
is, we observe a variable W/^ known to be greater than or equal to W*. Then the problem 
can be defined formally as estimating or performing inference on the identified set Qo{P) of 
values of ^ = (^i, O^i) that satisfy Ep{W,^\X,) > Oi + X'^O^i. 

To fix ideas, consider using the KS statistic defined above with the class of functions Q 
given by the set of indicator functions — s|| < h) with s ranging over real numbers and h 

ranging over nonnegative reals. The results in this paper allow other classes of functions for 
including other kernel functions, but this example captures the main ideas. For some positive 
weighting function Lo{9,s,h), define the KS statistic Tn^^{6) = sup ^ f^^lu {9, s, h)En(Wl^ — 
6*1 — X-6'_i)/(||X,; — s|| < where |r|_ = |r A 0|. This corresponds to the KS statistic 
defined above with S{r) = |r|_ and with the weight function g-(g ^ fc)vo- (here a{9,s,h) = 
{E^iWl' -9,- Xl9^^)Ii\\X, - s\\ < h)]' - -9,- x'i9l^)I{\{x, - s\\ < h)]'y/') 

replaced by an arbitrary weight function u{9,s,h). I derive rates of convergence for set 
estimates based on the truncated variance weight function — in Section |H In Section 

a{d,S,h)\/(7n — 

[6l I derive rates of convergence to the identified set for estimators based on KS statistics with 
u given by a function that is bounded uniformly in the sample size n. In the remainder of 
this section, I state these result s infor mally and describ e some of the intuition behind them. 
Following Andrews and Shi ( 2009 ) and Kim ( 20081 ). one can show that Tn,uj{9) will con- 



verge at a y/n rate under regularity conditions if u:{9, s, h) is bounded uniformly in n. How- 
ever, since the variance of the moment indexed by (6', s, h) will be arbitrarily small when 
h is small (Xj has a continuous distribution), setting co{9,s,h) equal to ^^^^^ gives a 
weight function that increases without bound as ct„ decreases with the sample size. This 
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decreases the rate of convergence from ^Jn to ^Jnj logn in general. The estimators of the 
identified set I propose in this paper are based on inverting KS tests with this weighting 
function, where ^/nj log nTn^^{6) is compared to a critical value c„ that is bounded or in- 
creases slowly. With a bounded weight function that does not increase with n, \/nTn^^(6) is 
compared to a bounded or slowly increasing critical value. 

In this paper, I consider rates of convergence of these confidence regions to the identified 
set. While power against a fixed sequence of local alternatives is a bit different than rates of 
convergence to the identified set (see the discussion at the end of Section 15. ![ the conditions 
in Section 15. 2[ and the example in Section IA.3I of the appendix for some of the issues that 
arise in going from sequences of local alternatives to rates of convergence to the identified 
set), much of the intuition for the results in this paper can be exposited in the context of 
a single sequence of local alternatives. Consider a value of 9 such that the regression line 
6i +X[9-i is tangent to the conditional mean Ep{Wl^\Xi) at a single point xq, and Xi has a 
density bounded away from zero and infinity near xq. This will typically be the case at least 
for some, if not all, elements on the boundary of the identified set. The results are the same 
if Xq is replaced by a finite set, and can be extended to cases of set identification at infinity 
or at a finite boundary in which xq may be infinite and the density of Xi may go to zero or 
infinity near xq by transforming the model (see Section ISTSj) . Suppose that, for some a > 0, 

Ep{Wl^ -Oi- X'fi^i\Xi = x) increases like - xoT (3) 

as \\x — xqW increases for x close to Xq. If Ep{Wl^\Xi = x) is twice different iable and Xq is 
on the interior of the support of X^, this will hold with a = 2, and a Lipschitz condition on 
Ep{Wl^\Xi = x) leads to a = 1. While other values of a appear less natural in this context, 
they are common in irregularly identified cases such as the selection model considered in 
Section 15. 5[ 

Consider the power of KS tests against local alternatives of the form 6'„ = {6ifi + an, 6-i,o), 
where 6q = (^i,o, ^-i,o) is on the boundary of the identified set and satisfies the above 
conditions for some a. Since moments centered at Xq will have more negative expected 
values under this sequence of alternatives, the moments with the most power for detecting 
this sequence of local alternatives will be those indexed hj s = xq and some sequence of 
values of h. For both classes of weight functions, the order of magnitude of the value of 
h that indexes the moment with the most power will be determined by a tradeoff between 
variance and the magnitude of the expectation. The KS objective function evaluated at 
some {9, s, h) is the sum of a mean zero term (i?„ — Ep){Wl^ — 9i — X-6'_i)/(||Xj — s\\ < h) 
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and a drift term Ep{W^ - 9i - X'-9^i)I{\\Xi - s\\ < h). Under s, h) with s = Xq, the 
drift term is 

Ep{W^ - 0,,n - X;e-l,n)/(||^i - XoW <h)^ Ep{W^ - e,,o -an- X',e_,,o)I {\\Xi - XoW < h) 

= Ep{W^ - ^1,0 - X;^-i,o)/(||^i - x4 <h)- anEpI{\\Xi - < h). (4) 

Some calculation shows that the first term in the above display is of order h'^+^x , while the 
second term in the above display is of order —Unh'^^ ■ 

Which values of h result in the corresponding moment having power depends on the 
mean zero term and the scaling, which depends on the weight function. First, consider the 
increasing sequence of weight functions given by u{9n,XQ,h) = xlh)\/a • ^^^^ case, 
the 0{h"^'^^ — anh'^^) term in the above display will be divided by a{6n, xq, h) V cr„, which, 
for (T„ small enough, will be approximately equal to the standard deviation of the moment 
indexed by (6'„, Xo,h), which is of order /i*^^/^, and compared to a critical value that is of order 
(n/ logn)~^/^ (the mean zero term will be of the same order of magnitude as the normalized 
critical value, so it will not affect the power calculation). Thus, the local alternative indexed 
by an will be detected if O (^^^^^^^^^^) < -0{n/ logn)-V2 for some h. The left hand side 

is minimized when h is equal to a small constant times al/"", which leads to the left hand side 
being of order —a^'^^+^^^Z^^"). This will be less than the —0(n/ logn)~^/^ critical value if a„ 
is greater than or equal to a large enough constant times {n/ logn)~"/*^'^^+^"). An argument 
that formalizes these ideas and adapts them to derive rates of convergence to the identified 
set rather than power against fixed sequences shows that this is the rate of convergence of 
set estimates based on KS statistics with the truncated inverse variance weight function I 
propose in this paper under more general conditions that include this model as a special 
case. 

Now consider using a KS statistic with a bounded weight function. The drift term will 
still be of order — On/i'^^ before being multiplied by the weight function, but, since the 

weight function is bounded uniformly in n, weighting will not increase the order of magnitude 
of the drift term. In this case, the KS statistics will be compared to a critical value of order 
and the mean zero term will be of a smaller order of magnitude, so that the local 
alternative indexed by a„ will be detected if 0(h"'+'^'' - Qnh'^'') < As before, 

the left hand side is minimized when h is equal to some small constant times an"". In this 
case, this leads to the left hand side being of order an^~^"^^°'. This will be less than the 
— critical value of a„ is greater than some large constant times j7,-"/(2dx+2a) ^j^j^jg 
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is a slower rate of convergence than the (n/logn)""/^'^-^^^"-' rate for estimaters that use the 
inverse variance weighting with increasing truncation points. 

The increase in power from weighting low variance moments by the inverse of their stan- 
dard deviations comes from the fact that local alternatives violate the conditional moment 
inequality on a shrinking subset of the support of the conditioning variable. If we require 
that the weight be bounded uniformly in n, low variance moments cannot be weighted prop- 
erly because the inverse of the standard deviation will be greater than the truncation point. 
One way of putting this is that the KS statistic chooses the optimal order of magnitude for 
the kernel bandwidth by performing a bias- variance tradeoff automatically, and the variance 
scaling makes sure that the correct variance is used in making this calculation. 

3 Coverage of the Identified Set 

In this section, I state conditions under which the confidence region C„(c„) contains the 
identified set Oo(-P) with probability approaching one. Under these conditions, these es- 
timates control the probability of falsely concluding that the data are not consistent with 
some parameter value. I show that the probability that the estimate contains the identified 
set converges to one uniformly in any class of probability distributions V that satisfy a set 
of assumptions stated below. Since these conditions do not restrict the smoothness of the 
conditional mean fh{d, x, P) or the distribution of the conditioning variable, this shows that 
the estimator is robust to many types of data generating processes, at least in the sense of 
controlling the probability of type I error. In contrast, rates of convergence derived later in 
the paper depend on additional smoothness conditions on the data generating process. Thus, 
while we can be reasonably confident rejecting potential parameter values with this method, 
the power of the KS statistic based estimates (and the other set estimators considered in 
Section [6]) will depend on the shape of the data generating process. 
I make the following assumptions. 

Assumption 3.1. gj{X.j) > P-a.s. for j from 1 to dy for g E Q and P eV. 

Assumption 13.11 states that the conditional moment inequalities are integrated against 
nonnegative functions, so that going from conditional moment inequalities to unconditional 
moment inequalities does not change the sign of the moment inequalities. 

Assumption 3.2. Forj from 1 to dy, define the classes of functions Tj^i = {smj{Wi, 6)gj{Xi) + 

t\9 ee,g eg,s,t e [-(Fvi),Fvi]} andJ^j^2 = {{smj{Wi,9)gj{Xi)+t)^\9 ee,g eg,s,t e 
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[— (y V 1), y V 1]} where Y is defined in Assumption \3.3\ below. Suppose that, for j from 1 to 
dy and i = 1,2, supg N{e, J^j^i, Li{Q)) < Ae~^ for < £ < 1 for some A,V > where the 
supremum over Q is over all pr obability measures and N{e, J-'j^i, Li{Q)) is the Li covering 



number defined in \Pollard (11984) 



Assumption 3.3. For some fixed Y >0, \mj{Wi,d)gj{Xi)\ < Y P-a.s. for j from 1 to dy 
for all P eV. 

Assumption 13.21 bounds tlie complexity of the classes of functions involved so that em- 
pirical process methods can be used. This condition will hold if the corresponding bounds 
hold for Q and {w h-)- m{w,9)\9 G 9} individually. In Section rA.4l of the appendix, I 
state sufficient conditions for Assumption 13.21 and verify them for some classes o f func- 



tio ns Q and the moment func t ions m from the examples in Section [5l See iPoUardl (1l984l ) 



or Ivan der Vaart and Wellnerl (Il996l ) for definitions and additional sufficient conditions for 
these covering number bounds. 

Assumption 13.31 is natural in many cases, such as models defined by quantile restrictions. 
In other cases, it restricts some variables to a finite interval. While this is clearly stronger 
than just bounding some of the moments of mj{Wi, 6)gj{Xi), when combinded with Assump- 
tion 13.21 it leads to rates of convergence that are uniform in 6 and g and in the underlying 
distribution with no additional assumptions on the shape of the conditional mean or variance 
or the smoothness of the cdfs of the random variables. 

I make the following assumption on the function 5*. These assumptions are satisfied by 
the function t — i- = \\t A 0|| for any norm || ■ || on Euclidean space. 

Assumption 3.4. S : M'^^ R+ satisfies (i) S{t) > iff. tj < for some j and (ii) 
for some positive constants Ks^i and Ks^2, we have, for any c > 0, S{t) > c =^ tj < 
—cKs^i some j and S{t) < c =^ tj > —cKs^2 all j . 

Finally, I make the following assumption on the sequence of cutoff values for the weighting 
functions. 

Assumption 3.5. (j„ is bounded from above and for some possibly data dependent value an, 



a„^/n/\ogn > a^. 

This assumption will be invoked with additional assumptions on how d„ is chosen. In all 
cases, I will require a„ to be bounded away from zero, but some of the results will require 
stronger conditions on dn- 
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Under these conditions with a„ and c„ chosen large enough, the probabihty of type I error 
(in the sense of the estimate not containing the identified set) converges so zero uniformly 
in P & V. In the following theorem, the constant K that determines how large a„ and c„ 
must be could in principle be calculated as a function of P using the maximal inequalities 
in the proof and then estimated. However, the resulting bounds would be conservative in 
most cases. In practice, it may be more sensible to take some data dependent value such as 
sup0ge i<j<rfy{-E'n[^j(W^i, 0) — EnTrijiWi, 6')]^}^/^ and multiply it by a sequence going slowly 
to infinity such as logra or log log n. 



Theorem 3.1. Suppose that Assumptions \3. 1[ \3.2[ \3.3\ \3.4\ and \3.5\ hold with (in > K and 
Cn ^ K with probability approaching one uniformly in P & V . If K is larger than some 
constant that depends only on V and Y in Assumptions \3.^ and \3.3[ then 



mf^P(eo(P)CC„(c„))"4°^l. 



The \/n/ log n rate of conv ergence of the KS statisti c is slower than the ^/n rate of conver- 
gence with fixed cr„ derived by lAndrews and Shil (l2009l ). In Section lA. 21 I show that the rate 
of convergence is strictly slower than ^/n under conditions that include m any cases of inter 
est. O ne m ight 



(120091) and 



Kim 



Andrews and Shi 



] ry to c onclude from this that the procedures proposed by 
( 120081 ) will suffer from type I error with probability approaching one if the 
cutoff for the weight function (l/o"„ in the notation of this paper) increases with the sample 
size. While this would be true if the critical value for these tests were held fixed, the tests 
proposed in these papers use estimated critical values that could increase with the sample 
size if (T„ goes to zero. If the critical values increase fast enough, these tests will still be 
valid, but it is not clear from existing results whether they do. Answering this question 
would require characterizing the behavior of these critical values for small cr„, and compar- 
ing them to rates of convergence for the weighted KS statistic such as those derived in the 
present paper. Such an app roach woul d hkely build on the ideas in this paper as well as 
Andrews and Shi (2009) and KimI ( 2008 ). using results on the asymptotic behavior of the KS 



statistic with increasing weights that build on those der ived in this paper , and 



comparing 



Kim 



them to new results on the critical values proposed by [Andrews and Shil (120091 ) and 
( 120081 ) under increasing weights, which would have to be derived and would likely require 
stronger conditions than the ones in this paper. In any case. Theorem 13.11 can be used to 
form estimates that contain the identified set with probability one, and choosing a critical 
value large enough to satisfy the assumptions of this theorem will typically not affect the 



14 



rate of convergence. This is the approach I take throughout the rest of the paper. 



4 Consistency and Rates of Convergence 

To get consistency and rates of convergence, we need additional assumptions that lead to 
Epm{Wi, 6)g{Xi) being large enough for parameters far from the identified set. Consistency 
and rate of convergence results are stated for the Hausdorff metric on sets. For a metric d 
on 9, define the Hausdorff distance between dniA, B) any two sets A and B by 

diji^A, B) = max{sup inf d{a, b), sup inf d{a, b)}. 

Here, I define dn to be the Hausdorff distance that arises when d is defined to be the 
metric associated with the Euclidean norm. Note that under the conditions of Theorem 13. 
0o(-P) ^ Cn{cn) with probability approaching one uniformly in P E V. When this holds, 
suPfeeeo(P) infagc„(c„) d{a, 6) = so that we just need to bound sup„gc„(c„) inffeeeo(P) dia, b). 

4.1 Consistency 

The following assumption states that for 9 bounded away from the identified set, some 
moment Epmj{Wi, 6)gj{Xi) is negative and is bounded away from zero. This assumption is 
used to obtain consistency, and is in general stronger than what would be needed for power 
against fixed points in 0\Bo(P), since consistency in the sense of convergence under some 
metric on sets requires that the power against fixed alternatives be uniform in alternatives 
bounded away from the identified set in this metric. 

Assumption 4.1. For every e > 0, there exists a 5 > such that, for all P E V, 
dffiOyQoiP)) > £ implies that there exists a g E Q such that Epmj(Wi,9)gj{Xi) < —6 
for some j . 



Theorem 4.1. Suppose that Assumption \4-l\ and the assumptions of Theorem \3. 1\ hold, and 



that suppgp P{cn^/ (log n)/n > r^) — )• for all rj > 0. Then, for every e > 0, 



SUpP{dH{&o{P),Cn{Cn)) > s) ™ 0. 
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4.2 Rates of Convergence under High Level Conditions 

While the focus of this paper is the interpretable conditions for rates of convergence of 
the estimate of the identified set given in Section I4.3[ I first present a result using a high 
level condition. The derivations of the rates of convergence in Section 14.31 use this result 
along with additional arguments relating the variance and expectation of the moments to 
the conditions in this section. The conditions in this section also encompass the case where 
local alternatives violate the conditional moment inequality on a non-shrinking set, leading 
to \/nl logn convergence (such as Assumption 15 . 1 1 1 for the application in Section 1531) . and 
it is instructive to compare the verification of the conditions in this section under these two 
types of set identification. 

The next assumption is a high level assumption that incorporates both the variance 
and expectation of the moments defined by each q & Q. The assumption is similar to the 



polynomial minorant condition in IChernozhukov. Hong, and Tamerl ( 120071 ). 



Assumption 4.2. For some positive constants C, tp, 7, and 6 with ip < 1, we have, (i) for 
all P eV ande eQ with duiO, QoiP)) < 6, 

inf ^TfmlfP^w. ^ -CdH{e,Q.{P)Y'^ 

9,3 apj{e,g}\/ dH{9,<do{P})f/^ 

where the infemum is taken over g E Q and j G {!,..., (iy} and (ii) cr„(n/ logn)'^/^ is 
hounded uniformly in P. 

Part (ii) of this assumption states that the cutoff cr„ must go to zero fast enough that 
the moments with the most identifying power relative to their variance are scaled by their 
standard deviation. How small apj{9, g) can be in the assumption is determined by how fast 
an goes to zero. If the assumption holds with ip small so that the infimum in the display 
is achieved when apj{6,g) is large relative to the distance from the identified set, can 
be chosen to go to zero more slowly. If part (i) holds for any tp, it will hold for ^p = 1, so 
that choosing (T„ so that part (ii) holds for ip = 1 will lead to the assumption holding in 
a larger set of cases when the researcher is unsure which g functions have the most power. 
In the cases considered here, this will not affect the rate of convergence, but will have a 
negative effect on the tradeoff between power and size when considering power against local 
alternatives at a particular rate. In other words, part (i) of Assumption 14.21 is weakest when 
■ip = 1, so, since cr„ can always be chosen to go to zero at a [{logn)/nY^'^ rate so that part (ii) 
holds with ip = 1, the researcher can just choose cr„ this way to have the rate of convergence 
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given in the next theorem hold under the weakest possible conditions. 

The following theorem gives rates of convergence to the identified set under this assump- 
tion. 



Theorem 4.2. Suppose that Assumption \4.1\ and \4-S\ hold, and that Assumptions \3. 1\ \3.2l 



\3.3\. \ 3.4\ and \3.5\ hold with an and Cn chosen to satisfy the requirements of Theorems \3.1\ and 




4 .11 Then, for some large B that does not depend on P, 

The results in the next section use Theorem 14.21 along with additional arguments to 
formalize the intuition described in Section 12.11 The balancing of the mean and variance 
described in Section [TT] plays out through the ratio of the mean fipj{9,g) and the standard 
deviation apj{6,g) in Assumption 14.21 This determines the best attainable value of 7 in 
Assumption 14.21 If a sequence of g functions can be found such that, as the distance of 9 
to the identified set decreases, the magnitude of fipj{9,g) decreases much more slowly than 
apj{9,g), the left hand side of the display in Assumption 14.21 will be large in magnitude, 
so that the condition will hold with a larger value of 7. It is useful to contrast this with 
the case where local alternatives violate one of the conditional moment inequalities on a 
non-shrinking set. In this case, g can be chosen to be some fixed function that is positive 
only on this set. This leads to apj{9,g) being fixed while fipj{9,g) typically goes to zero 
at a rate proportional to dH{9,Qo{P)), so that Assumption 14.21 holds with 7 = 1, and 
Theorem 14.21 gives a \/n/ \ogn rate of convergence for the set estimator (see the proof of 
the part of Theorem 15.61 that applies under Assumption 15 . 1 11 for more details). In cases like 
those described in Section fHX the best attainable ratio of fipj{9,g) to apj{9,g) depends on 
smoothness properties of the data generating process and leads to a smaller 7 and a slower 
rate of convergence. The results in the next section cover this case. 



4.3 Interpretable Conditions for Rates of Convergence 

Assumption 14.21 is a high level condition that incorporates both the expectation and variance 
of each g function. The next assumptions place restrictions on the shape of the conditional 
mean m{9, x, P) = Ep{m{Xi, 9)\Xi = x) as a function of x and 9 that can be used to verify 
Assumption These conditions shed light on how the shape of the data generating process 
and Th{9, x, P) as a function of 9 and x determine the rate of convergence, and are easier to 
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verify in many applications. Once consistency is established, these assumptions only need 
to hold for dH{6, 60 (-P)) < e for some e > 0. 

Assumption 4.3. m{6,x,P) is diff'erentiable in 9 with derivative rhg{9,x, P) that is con- 
tinuous as a function of 9 uniformly in {9, x, P) 

Assumption 4.4. For some rj > and C > 0, we have, for all 9 G O\0o(-P), there exists 
a jo{9,P), 9q{9,P) and Xo{9, P) such that 

me,,,(e,p)m9,P),x^{9,P),P){9 - 9^{9,P)) < -r^\\9 - 9,{9,P)l 

'rnjo{e,p)i0o{9,P),xo{9,P),P) = 0, and, for ||x-Xo|| < ?7, 

\rhj,^e,p){eo{9, P),x, P) - m,,^eMM^' P),Xo{e, P),P)\<C\\x- x,{9, P)r. 

The first part of Assumption 14.41 states that, for 9 close to the identified set, there is some 
element in the identified set such that that moving from this element to 9 corresponds to 
moving some index of the conditional mean downward. This assumption restricts the angle 
between the path from 9 to some point on the identified set and the directional derivative of 
the conditional mean for 9 along this path. To see that the first part of Assumption 14. 41 comes 
from a condition on the magnitude of the derivative of the conditional mean with respect to 
9 and the angle of between the derivative and the difference between 9 and some point on 
the identified set, note that, letting be the angle between Th9jQ(^g^p){9o{9, P),xo{9, P), P) 
and9 -9o{9,P), 

meMe,P)W, P),Xo{9, P), P){9 - 9o{9, P)) 
= \\rn9,Me,P)W, P),Xo{9, P), P) \\\\9 - 9,{9, P) \\ cos0. 

Thus, the first part of Assumption 14.41 will be satisfied if \\fh0 j^(^Q^p){9o{9,P),xo{9^P),P)\\ is 
bounded away from zero and cos0 is negative and bounded away from zero. 

The second part of Assumption 14.41 is a restriction on the shape of the conditional mean 
as a function of x for 9 on the boundary of the identified set. Combining this with the first 
part of the assumption determines which functions in Q have power under local alternatives. 
As verified for several models in Section [5l this typically follows from Holder conditions or 
conditions on the first two derivatives of conditional means or quantiles of variables in the 
data, leading to some value of a between zero and 2, or from conditions on densities and 
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conditional means near the boundary of the support of the conditioning variable, which can 
lead to larger values of a after a transformation of the data. 

To better understand how Assumption 14.41 factors into the rate of convergence, it is 
helpful to relate it to the discussion in Section [2]T] giving an informal overview of the results 
for the interval regression model. The interested reader can consult the proofs of the results 
in Sections 15. II and 15.21 for more details. The second part of Assumption 14.41 is the condition 
described in The first part of Assumption 14.41 relates to the choice of local alternative 
used in Section [2?T1 In that section, we fixed a parameter Qq = (^i,o;^-i,o) on the boundary 
of the identified set, and considered local alternatives of the form 6'„ = (^^l o + a„,6'„i o) 
for some positive sequence a„ — t- 0. This leads to the characterization of the drift term of 
the KS objective function in (j4]). The same argument goes through for most types of local 
alternatives that also vary the slope, but certain types of local alternatives have to be ruled 
out. In the interval regression example, these correspond to local alternatives that rotate 
the regression line around a single tangency point. For example, in the example in Section 
12. H suppose dx = 1, and Xq = 0. If we instead took a sequence of local alternatives of the 
form 9'^ = (0, a„), the last line in (jl]) would instead be 

Ep{Wf - ^1,0 - X,^_i,o)/(||X, -x4<h)- anEp{X, - xo)/(||X, - xo|| < h). 

Going through the rest of the argument with anEpI{\\Xi — Xo|| < h) replaced by anEp{Xi — 
xo)I{\\Xi — a;o|| < h) gives a slower rate of convergence because the latter term goes to 
zero more quickly as h decreases (see Section IA.3I for a more detailed treatment of this 
counterexample) . 

The first part of Assumption 14. 4l ensures that these types of sequences of local alternatives 
do not determine the rate of convergence. To see how this works, note that, applying the 
left hand side of the first display of Assumption 14.41 to the interval regression example gives 
'fne{Oo,XQ, P){6 — 6q) = —{1,xq){9 — 9o). Thus, in order for Assumption to hold for some 6* 
and this value of 6q, (1, xo){6 — 0o) niust be positive and have the same order of magnitude as 
116*— 6'o||. For 6'„ in the above example, this is (1, Xo)(6'„— 6^0) = (1, Xo)(a„, 0)' = a„ = ||6'„— 6*011, 
so the first display of Assumption 14.41 holds. For the example with 6'^ = (0, a„) (and Xq = 0) 
(l,xo)(^^ - ^0) = (1,0)(^; - ^0) = (l,0)(0,a„) = 0, so the first display of Assumption US 
does not hold. 

The next assumption states that, for any P eV, all points must either be outside of the 
support of Xi under P, or have sufficient probability mass nearby. While this assumption 
rules out Xj having infinite support or having a density that goes to zero near the boundary 
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of its support, these cases can typically be handled by transforming the data to make this 
assumption hold. I do this for one application in Section 15.51 

Assumption 4.5. For some r] > 0, we have, for all P E V and all e > 0, P{\\Xi — x\\ < 
£)/£^^ > ^ X on the support of Xi. 

The next assumption ensures that the set of functions Q is rich enough to contain func- 
tions that behave like indicators of small sets. This assumption holds for any class that 
contains indicator sets of the open balls for any norm on M*^^, or, for any nonnegative 
bounded kernel function k : M*^-^ IR+ with finite support and k{x) bounded away from 
zero near x = 0, the class {x h-?- k{{x — t)/h)\t G M, /i > 0} that contains all dilations and 
translations of the kernel function k. 

Assumption 4.6. The functions in Q are uniformly hounded and for some constants < 
Cg^i < 1 and < Cg^2 we have that, for all s G and t > 0, Q contains a function g 
such that Cg^iI{\\Xi - s|| < Cg^2t) < 9{Xi) < I{\\Xi - s\\ < t). 

The next theorem gives rates of convergence under these assumptions. 

Theorem 4.3. Suppose that Assumptions \4-^ \4-4\ \4-^ anc? \4-(^ hold. Then part (i) of 
Assumption \4.2 holds with 7 = 2a /{dx + 2a) and ip = dx/{dx + 2a). 

Applying Theorem 14. 2[ this gives a (ri/ logn)"/*^'^-^^^") rate of convergence as long as 
the cutoff point (T„ for the standard deviation weighting decreases at least as quickly as 
((log?T,)/'n.)'^/^ = (ra/ logn)'^-^/*^^'^^+'*"\ but slightly more slowly than ({logn) /n)^^'^ , so that 
Assumption 13.51 will hold with d„ large enough. One choice of an that will work regardless of 
a is to take some data dependent value like supg^Q i^j^^^{En[m{Wi, 9) — Enm(Wi,9)]'^}^/^ 
and multiply by ((logn)/'n,)^/^6„, where 6„ is a sequence that goes to infinity more slowly 
than any power of n (such as 6„ = logn). 



5 Applications 

In this section, I verify the conditions for rates of convergence stated above for some appli- 
cations under primitive conditions. I start with a one sided regression model. 



5.1 One Sided Regression 

We posit a linear regression model Ep{W*\Xi) = X[(5 for a latent variable W* , but we 
only observe (Xj, W^^), where W^^ is known to be greater than or equal to W* . This leads 
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to the conditional moment inequality Ep{W^^\Xi) > which fits into the framework 

of this paper with dy = I, Wi = (X^, W^^) and m{Wi,e) = W^^ - 6^ - X[e_i. Here, 
Th{9,x) = Ep{Wl^\Xi = x) — 9i ~ x'9-i. I verify the conditions used above to derive rates 
of convergence (Assumptions 14.31 and I4.4p under the following assumptions. 

Assumption 5.1. For some C > and < « < 1, \\Ep(Wl^\Xi = x) - Ep{Wl^\Xi = 
< C\\x — x'W"' for X and x' on the support of Xi for all P 

Assumption 15 . 1 1 places a Holder condition on the conditional mean of the upper bound of 
the outcome given X^. This is a smoothness condition on the data generating process. For 
a = 1, Assumption 15.11 states that this conditional mean must be Lipschitz continuous. For 
smaller a, the conditional mean must still be continuous, but can be less smooth. 

For a > 1, a condition like Assumption 15 . 1 1 would restrict Ep{Wl^\Xi = x) to be constant, 
since its slope would have to converge to zero at every point. However, as described in 
Section [2?T1 this condition factors into the rate of convergence only in restricting Ep{Wl^ — 
01 — X[9^i\Xi = x) to increase no faster than a multiple of ||x — xo||" near some tangency 
point Xq for 6 = {6i,6_i) on the boundary of the identified set. The same argument will 
still go through as long as this restriction on the difference between Ep{Wl^\Xi = x) and 
a tangent line holds for some a, even if a > 1. While placing this condition directly on 
Ep{Wl^ — 6i—X'-6_i\Xi = x) near tangency points is a bit awkward in general, this condition 
has a natural interpretation when a = 2. In this case, it requires that the difference between 
the conditional mean Ep{Wl^\Xi = x) and any tangent line behave quadratically near the 
tangent point, which is implied by a bound on the second derivative. This is the content of 
the next assumption. 

Assumption 5.2. (i) Ep{Wl^\Xi = x) has a second derivative that is bounded uniformly 
in P and x and (ii) for any P E V , 9q E Qq{P), Ep{Wl^\Xi = x) is bounded away from 
00,1 + x'6o,~i on the boundary of the support of Xi 

The next assumption ensures that the condition on the tangent angle in Assumption 14.41 
holds. Under this assumption, rates of convergence to the identified set depend on sequences 
of parameters in which only the intercept parameter varies. This condition ensures that 
varying the intercept parameter a small amount near the boundary of the identified set gives 
an element that is still in the parameter space 0. 

Assumption 5.3. The subvector 9_i of 9 is bounded over 9 E Q and, for any 9 E Q, 
{9[,9.i) G e for all 9[ G M. 
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Theorem 5.1. Suppose that Assumptions \5. 31 holds in the one sided linear regression model 
and Xi has compact support for all P eV. Then, if Assumption l57l\ holds. Assumptions \4-3\ 
and \4.4\ will hold for a specified in Assumption \5.1[ If As sumption \ 5. S\ holds, Assumptions 



\4.3\ and 4Ji_ "^"^^^ hold for a = 2. 



If the parameter space is restricted so that all sequences of local alternatives corre- 
sponded to rotating the regression line around a tangent point, Assumption 15.31 will fail and 
the rate of convergence will be slower. The verification of the assumptions of Theorem 14.31 
will not go through in this case because the first part of Assumption 14.41 will fail. As an exam- 
ple, suppose Ep{Wl^\Xi = x) = x^. If the parameter space 6 does not restrict the intercept 
parameter, the proof of Theorem 15.11 will go through. However, if G = {(0, 6i)\6i G M} (that 
is, we restrict the intercept to be 0), the rate of convergence will be determined by local 
alternatives of the form (0,a„). This corresponds to the sequence of local alternatives 
in the discussion in Section 14.31 For the same reasons described in that section, the first 
part of Assumption 14.41 will not hold, leading to a slower rate of convergence. I show in 
Section IA.3I of the appendix that the estimate of the identified set converges no faster than 
at a ((logn)/?T,)^''^ rate, rather than the ((logn)/n)^'^^ rate for the case where the parameter 
space is unrestricted. 

These issues also make it more difficult to state primitive conditions that lead to As- 
sumption 14.41 in the case of two sided interval regression, in which we add the conditional 
moment inequality m2iWi, 9) = 9i + X'-9_i — W.^ . As with restricting the parameter space, 
adding the second conditional moment inequality can lead to the rate of convergence being 
deterimined by sequences of local alternatives that correspond to rotating the regression line 
around a tangent point. One example that leads to this is when Ep{Wl^\Xi = x) = x"^ 
and Ep{W^^\Xi = x) = — x^. Adding the moment inequality on W^^ has the same effect as 
restricting the intercept to be zero in the example above. The rate of convergence to the 
identified set is determined by local alternatives of the form (0, a„), which leads to a slower 
rate of convergence. The argument in Section IA.3I applies here as well, leading to a slower 
((logn)/n)^/^ rate of convergence. 

For the case where Xi is a scalar, these cases can be ruled out in the interval regression 
model by requiring that the conditional means of Wf^ and W^" be bounded away from each 
other. I go through this argument in the next section. However, higher dimensions appear 
to require further conditions. 
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5.2 Interval Regression with a Scalar Regressor 

In the case of a single regressor, these types of slow convergence of a slope parameter in the 
interval regression model can be ruled out by relatively simple conditions. In what follows, I 
consider an interval regression model in which, in addition to W^^ defined as in Section 15. 
we observe Wf' that is known to satisfy Wf' < W*, so that Ep^Wf^lXi) < 9i + X[d^i. This 
fits into the framework of this paper with m{Wi, 6) = {W/^ - Oi - X[e^i, Oi + X[e-i - Wl"). 
I restrict attention to the case where = 1, so that = ^2 is a scalar. 



In addition to the assumptions used in Section 15. 1[ I impose the following assumption, 
which rules out cases like the one described above in which local alternatives correspond to 
rotating the regression line around a tangent point. 

Assumption 5.4. (i) The support of Xi is hounded uniformly in P eV. (ii) The absolute 
value of the slope parameter 62 is bounded uniformly on the identified sets 0o(-P) of P eV. 
(Hi) Ep{Wl^\Xi = x) — Ep(Wl"\Xi = x) is bounded away from zero uniformly in x and V. 



Theorem 5.2. In the interval regression model with dx = 1, suppose that Assumption 5.4 
holds. Then, if Assumption [3T7] holds as stated and with W.^ replaced by , Assumptions 
and \4-4\ will hold for a specified in Assumption I5.il (and dx = 1). If Assumption \5.2\ 
holds as stated and with Wl^ replaced with W^" , Assumptions \4.3\ and \4.4\ will hold for a = 2 
(and dx = i)- 



5.3 One Sided Quantile Regression 

In this and the next section, I treat quantile versions of the regression models considered 
above. Here, we have a model for a conditional quantile of the unobserved variable W* 
rather than the mean. The results are essentially the same, but, in addition to smoothness 
conditions on the quantile itself, conditions are needed on the joint density of the observed 
variables near the conditional quantile to translate these into the conditions on rh{6,x,P). 

First, consider the one sided case in which we observe {Xi,Wl^) with W^^ > W*. For 
a random variable Zi, define qr.p{Zi\Xi) to be the rth quantile of Zi conditional on X^ 
under P. Suppose that, for some known r, the conditional rth quantile of W* satisfies 
QrA^tlXi) = ^1 + for some 6. Then Ep[t - I{W* < Oi + = so that 

Ep[t - I{Wi^ <0i+ Xie_i)\Xi] > 0. Thus, this fits into the framework of this paper with 
Wi = {Xi, Wl") and m{Wi, 9) = r - liW/" < 9^ + X[9^i). 

In many situations, models for quantiles of an outcome variable given covariates can be 
more informative under interval data than models for the conditional mean. If Wh can 
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be infinite with positive probability conditional on any value of Xj, the identified set for a 
conditional mean model will be the entire parameter space. If Wh has a low probability of 
being large or infinite, and is usually close to W*, a model for conditional quantiles of the 
unobserved variable will still give informative bounds with interval data. 

Smoothness conditions that lead to Assumptions 14.31 and 14.41 for the quantile model are 
similar to those for the conditional mean considered above, but with smoothness assump- 
tions placed on the conditional quantile qr^p{Wl^\Xi) rather than the conditional mean, and 
additional assumptions on the joint density of The first two assumptions are 

exactly the same as Assumptions 15. 1 1 and 15. 2[ but with the conditional mean replaced by the 
conditional rth quantile. 

Assumption 5.5. For some C > and a < 1, \\qr,p(W/^\Xi = x) - qr,piWl^\Xi = x')\\ < 
C\\x — x'll" for X and x' on the support of Xi for all P & 

Assumption 5.6. (i) qr^p{Wl^\Xi = x) has a second derivative that is bounded uniformly 
in P and x and (ii) for any P & V , 6q & Qo{P), qr^p{Wl^\Xi = x) is bounded away from 
^0,1 + x'Oq^^i on the boundary of the support of Xi. 

The next assumption states that W^^ has a density near its rth quantile conditional 
on Xi. One type of interval data that will frequently lead to this assumption holding is if 
{Xi, W*) has a well behaved joint density, and W^^ is equal to W* with high probability and 
much larger than W* with some small probability. For example, suppose that {Xi, W*) has 
a joint density, and, Wl^ is either equal to oo or W*, with P{W^^ = oo|Xj = x, W* = w) 
a smooth function of {x, w) that is bounded from above by some constant strictly less than 
1 — r. Then {Xi, Wf^) will have a joint density near the rth conditional quantile of Wi^ . 
This type of situation arises naturally with missing data on an outcome variable. However, 
other types of interval data will not lead to this assumption holding. If Wf^ is the upper 
end of an interval from a survey in which W* is always reported in the same interval, Wf^ 
will not have a density conditional on Xi. 

Assumption 5.7. For some r] > 0, Wl^\Xi has a conditional density fw^ixS'^l^) 
{{x, w)\qr,p{Wl^ \Xi = x) — Tj <w < qr^p{Wl^\Xi = x) + Tj} that is continuous as a function 
of w uniformly in {w,x,P) and satisfies f < fw^^ixSM^) — f /^^ some < / < / < oo. 

Under these conditions. Assumptions 14.31 and 14.41 will hold for the one sided quantile 
regression model. The proof is similar to the proof of Theorem 15. II in the one sided regression 
model. The only difference is that some additional steps are needed to translate smoothness 
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conditions on the rth quantile into smoothness conditions on the objective function using 
the assumptions on the conditional density of W^^ given Xj. 

Theorem 5.3. Suppose that the support of Xi is bounded uniformly in P E V, and that 
Assumptions \5.3\ and \5.7\ hold in the one sided quantile regression model. Then, if Assumption 
\5.5\ holds. Assumptions \4.3\ and 4-4 '"^^^^ hold for a specified in Assumption \5.5[ If Assumption 



\5.6\ holds, Assumptions \4.3\ and \4-4\ will hold for a = 2 



5.4 Interval Quantile Regression with a Scalar Regressor 

Now consider a quantile regression model with two sided interval data in which, in addition to 
Wl^, we observe a variable Wf' that is known to satsify Wf' < W*. This leads to EplKyVf' < 
di + X[e^i) - T\Xi\ > Ep[I{W* < 01 + X'^e^i) - t\X,] = O so that the interval quantile 
regression fits into the conditional moment inequality framework with Wi = {X^, W^^, Wf^) 
and m{Wi, 6) = {t - liW^" < 9^ + X[d_^), I{Wt < + X[d^^) - t). 

As with the case of mean regression, the condition on the angle of the derivative and path 
in Assumption 14.41 will not hold in general in the quantile regression model with two sided 
interval data because of cases where alternatives are closest to a point in the identified set 
where the regression line is rotated around a contact point. Sufficient conditions to rule this 
out in the case of a scalar regressor are similar as well. Bounding the conditional quantiles 
of the upper and lower endpoints of the interval away from each other rules out these cases 
when the regressors include only a constant and a scalar. The next assumption is the same 
as Assumption 15. 4[ but with conditional expectations replaced by conditional rth quantiles. 

Assumption 5.8. (i) The support of Xi is hounded uniformly in P eV. (ii) The absolute 
value of the slope parameter 62 is bounded uniformly on the identified sets Oo(-P) of P eV. 
(Hi) qr^p{W/^ \Xi = x) — qr,p{Wl'\Xi = x) is bounded away from zero uniformly in x and V. 

The next theorem states that KS statistic based set estimators will have the same rate 
of convergence as in the one sided model with a scalar regressor under these conditions, and 
the assumption stated earlier on the density of the observed variables. The proof is similar 



to the proof of the analogous result for mean regression. Theorem 15. 2[ but with additional 
steps to translate conditions on quantiles and densities into conditions on the conditional 
mean of the objective function. 

Theorem 5.4. In the interval regression example with dx = 1, suppose that Assumptions 
5. ?| and IJ.iSI hold, and that Assumption \5. 7| also holds with Wf^ replaced by W^" . Then, if 
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Assumption \5. 51 holds as stated and with replaced by W^" , Assumptions ^.^ and \4-4\ will 



hold for a specified in Assumption \5.5\ (and dx = I)- If Assumption [3761 holds as stated and 
with Wl^ replaced with W^" , Assumptions \4-.^ and \4.4\ will hold for a = 2 (and dx = 

5.5 Selection Model and Identification at the Boundary 

In this section, I treat a class of models in which the conditional moment inequalities give the 
most identifying information when conditioning on a set where Xi may not have a density 
that is bounded away from zero and infinity. That is, as 9 approaches the identified set, the 
moment inequality Ep{m{Wi,9)\Xi = x) > is violated on a region in which the density 
of Xi goes to zero or infinity, or in which Xj does not have a density with respect to the 
Lebesgue measure. This covers cases of conditional moment inequalities leading to point 
or set identification at infinity or at a finite boundary. While I motivate the conditions in 
this section with a selection model, the results apply more generally to other cases of set 
identification at the boundary. 

The selection model is particularly interesting in that it leads naturally to different shapes 
of the conditional mean of m{Wi,9) and distribution of Xj, since set identification at the 
boundary of the support of Xi appears to be a common case. For cases where the conditioning 
variable has a density function that goes to zero or infinity near a (possibly infinite) support 
point, a transformation of the conditioning variable leads to a model for which the smoothness 
assumptions for rates of convergence given in this paper can be verified. The resulting value 
of the Holder constant a depends on the shape of both the density and the conditional mean. 
This is related to cases of po int identification at infinity, such as the estimator proposed by 



Andrews and Schafgansl (119981 ) for a selection model similar to the one treated in this section, 
but under conditions that lead to point identification. As with the estimator proposed in that 
paper, the estimators I consider based on KS statistics for conditional moment inequalities 
and possible set identification have rates of convergence that depend on the tail behavior of 
the random variables in the model. The behavior of distributions of random variables at the 
tails determines which functions in Q correspond to the region of the tail of the conditioning 
variable with the most identifying power. The truncated variance weighting I propose allows 
the KS statistic to automatically find these functions. 

We are interested in the marginal distribution of a random variable Y* but we do not 
always observe this variable. Instead, we observe {Yi,Di) where Di is an indicator for being 
observed in the sample and Yi = Y* ■ Di. For example, suppose we are interested in the 
distribution of wage offers for a population of individuals, but we only observe wages of 
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people who decide to work. In this case, Y* is the wage individual i is offered, and Di is an 
indicator for employment. In what follows, Yi and Di are scalars, but the results described 
below can be extended to multiple partially observed outcomes. In the treatment effects 
literature, potential outcomes under different treatment programs are typically modeled as 
latent variables, with the observed variable being the actual treatment. In this case, we can 
consider each possible treatment separately, each time defining Y* and Di to be potential 
outcomes and indicators for the treatment group in question. Bounds on the marginal 
distribution for each treatment will follow from methods described in this section, and these 
bounds can be combined to give bounds on treatment effects defined as differences between 
statistics of the unobserved distribution of each outcome. 

If Y* is not independent of Di and Di = Q with positive probability, the distribution of Yi 
will be different from the distribution of Y* conditional on entry. However, it is often possible 
to obtain informative bounds. Suppose that we observe a random variable Xj that shifts 
participation in the sample, but is exogenous to outcomes in the sense that Y* is independent 
of Xi. li Yj is known t o lie in some interval [11,1^], we can bound the distribution of Y* 
following iManskil (jl990[ ). In this section, I consider estimation of bounds for the mean of 
the distribution of Y* , but bounds on quantiles can be estimated using similar methods. For 



the same reasons as those described in Section 15. 3[ bounds on quantiles will often be tighter 
than bounds on the mean when the difference between Y_ and Y is large or infinite. 

To see how this model fits into the framework of this paper, note that 1^ ■ + F ■ 
(1 - Di) <Y* <Yi - Di + Y ■ {I - Di), so that, letting 7 = Ep{Y*) = Ep(Y*\X), we have 
Ep{Y,-D, + Y-{l-D,)\X) < 7 < Ep{Y,-D, + Y-{l-D,)\X). Define W^^ = Yi-D, + Y-{1-D,) 
and Wf^ = Yi ■ Di + Y ■ {1 — Di). The problem of estimating the identified set for 7 fits into 
the framework of this paper with Wi = (PVj^, W^^ , Xi) and m{Wi, 7) = (7 - Wf", W^^ - 7)'. 

Typically, the best upper and lower bounds on 7 will come from values of Xi for which 
the probability of participation is high. If participation is monotonic, these points will be 
near the support of Xi. The support of Xi could be infinite or finite, and there is typically 
no reason to impose any conditions on how the distribution of Xi behaves near its support 
points (whether it has a density, whether the density approaches zero, infinity, a positive 
constant, or oscillates wildly) or how Ep{Wl^\Xi) and Ep{Wl"\Xi) behave near these points. 
In addition, while identification at the boundary of the support seems likely, it is best not 
to impose this either. 

The results in this section show that estimates of the identified set using weighted KS 
statistics defined above are robust to all of these types of set identification in the sense of 
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controlling the probability that the set estimate fails to contain the identified set uniformly 
in a set of underlying distributions that contains these types of distributions and many more. 
In addition, for a wide variety of shapes of the density and conditional mean, the weighted 
KS statistic based set estimate obtains a better rate of convergence than estimates that do 
not weight the KS statistic. 



Uniform coverage of the identified set follows immediately from Theorem 13. and is 
stated in the next theorem. Throughout this section, 60 (-P) denotes the identified set for 
7 in the selection model under P, and C„(c„) denotes an estimate of this set as described 
above. 

Theorem 5.5. Let V he any class of probability measures on the random variables in the 
selection model described above such that and Wf" are bounded uniformly over P & V. 
If the class of functions Q , the function S, and the sequences an and Cn are chosen so that 
Assumptions \3. 1\ \3.2\ 3.4 and \3.5\ hold with an and Cn chosen so that the assumptions of 
Theorem \3.1\ hold, then 

mf P(eo(P) CC„(cO)"^"l. 

Rates of convergence to the identified set will depend on the shape of the conditional 
mean and the distribution of Xj. Note, however, that the set estimate based on the standard 
deviation weighted KS statistic can be calculated in the same manner regardless of these 
aspects of the data, so the researcher does not have to impose any restrictions on the shapes 
of these objects when performing inference. In this sense, inference based on these statistics 
adapts to the shapes of the conditional means of W^^ and W^" and the distribution of Xj. 
In what follows, I consider several alternative assumptions. These include different types of 
set identification at the boundary, as well as set identification on a positive probability set. 

In the following assumptions, [7,7] is the identified set for 7, so that it is implicitly 
assumed that Ep{Wl^\Xi) > 7 and Ep{Wl"\Xi) < 7 with probability one. Here, 7 and 7 
could be equal, leading to point identification. This will be the case when the probability of 
selection into the sample conditional on Xi = x converges to one as x approaches some point 
on the support of Xi. These assumptions are stated so that the same type of identification 
holds for the upper and lower support of the identified set, but the same results will hold (with 
possibly different rates of convergence to the upper and lower support points) if different 
types of identification hold for the upper and lower support. When these assumptions are 
invoked for a class of probability distributions V, the constants C, Kx, and rjx are assumed 
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not to depend on P. 



Assumption 5.9 (Set Identification at Infinity with Polynomial Tails), dx = ^ and, for 
some positive constants Kx and C , we have, for all x > Kx, (i) Ep{Wl^\Xi = x) — 7 < 
Cx^'^"' and (ii) Xi has a density fx{x) such that fx{x) > x^'^^'/C for some (f)„i > and 
(px > ^- In addition, part (i) holds with W^^ — 7 replaced by 'j — Wl" . 

Assumption 5.10 (Set Identification at Finite Support with Polynomial Tails). For some 
Xq G and rjx > 0, we have, for xq — rjxL < x < xq (where l is a vector of ones and < 
is elementwise if dx > ^) (i) Ep{Wl^\Xi = x) — 7 < C\xo — xl"^'" and (ii) Xi has a density 
fx{x) such that fx{x) > Y[k=i l-^o.fc — Xk]^'' /C for some (pm > and some (px > —1. In 
addition, parts (i) and (ii) hold with W^^ — 7 replaced by •j — W^^ for some possibly different 

Xq. 

Assumption 5.11 (Set Identification on a Positive Probability Set). For some interval 
[x,x], Ep{Wl^\Xi) —7 = P-a.s. for all P and P{x < Xi <x) is bounded away from 
zero uniformly in P In addition, the same assumption holds with with W^^ replaced 

by '-f — Wl" for some possibly different interval [x, x]. 

All cases of Assumption 15.91 and 15.101 can be transformed into Assumption 15.101 with 
0a; = and some (pm by monotonic transformations of each element of Aj. The case where 
Assumption 15.101 holds with = fits into the framework of Theorem 14.31 so this can be 
applied to the transformed model. 

Theorem 5.6. Let V be any class of probability measures on the random variables in the 
selection model described above such that W^^ and Wf" are bounded uniformly over P E V. 
Suppose that the class of functions Q, the function S, and the sequences and Cn are chosen 
so that Assumptions \3. ![ \3.2\ \3.4\ and \3.5\ hold with a„ and Cn chosen so that the assumptions 



of Theorem \3. 1\ hold, and Assumption \4-(^ holds. 

If, in addition to these conditions, one of Assumptions lST^ or l5AV\ holds, then, for some 

B, 




,/{dx+2a) 

dHiCn{Cn),eoiP)) 




where a = (pm/{(px + 1) if Assumption \5.1(J\ holds and a = (pm/{(px — 1) (and dx = 1) if 
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Assu'mption \5.9[ holds. If Assumption [3T771 holds, then, for some B, 



SUppff— ^) ' dH{Cn{Cn),Qo{P))>B \ 0. 

Pev \\cl\ognJ J 

The rate of convergence in Theorem 15.61 shows that, for a given selection process condi- 
tional on Xi, the rate of convergence will be faster when Xi has more mass near the point Xq 
or region [x,x] where the conditional moment inequalities give the most identifiying infor- 
mation. The rate of convergence is fastest (((logn)/n)^/^) under Assumption 15 . 1 1| when this 
region has a positive probability. Under identification at a finite point (Assumption I5.10|) . 
the rate of convergence depends on whether the density of X^ approaches infinity, zero, or a 
finite nonzero value. If — 1 < 0^. < 0, the density will approach infinity at a rate that is faster 
when (px is closest to —1 {(px must be strictly greater than —1 in order for the density to 
integrate to a finite number). For (p^ = 0, the density approaches a finite nonzero value, and, 
for (px > the density approaches zero at a rate that is faster for larger values of (px- The 
rate of convergence under Assumption 15.101 will always be slower than {{\ogn)/nY^^, but it 
will be arbitrarily close to this rate when (px is close to —1 (when the density approaches 
infinity at close to the fastest possible rate). Under identification at infinity (Assumption 
15. 9p . the rate of convergence will be faster for thicker tails (smaller (px), and will be close to 
((log n)/n)^/^ for (px close to 1 (in this case, (px must be greater than one in order for the 
density to integrate to a finite number). 



6 Rates of Convergence for Other Estimators 



In order to compare the estimators based on KS statistics with increasing variance weights 
proposed in this paper to estimation procedures based on kernels or KS statistics with 
bounded weights, we need rates of convergence for these estimators as well. Since these 



Andrews and Shi 



result s are not av a ilable in the literature (with the exception of the results of 
(120091 ) and iKimI (120081 ) for the local power of KS statistics with bounded weights, which 
apply to the model in Section 15.51 under the positive probability set identification condition. 
Assumption 15. 1 ll but not the other models or conditions in this paper), I derive these results 
in this section. 

Under upper bounds on the smoothness of the data generating process that corre- 
spond to the lower bounds in Assumptions 14.31 and 14.41 I show that estimators based on 
KS statistics with bounded weight functions converge at a 77,'^/(2<^js: +2a) ^ate, slower than 
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the (n/logn)°/('^-^+2") rate of convergence derived in Section H] for the estimator based on 
the trun cated varian ce weighting with the sequence of truncation points increasing quickly 
enough. iKiml ( 20081 ) shows that the rate of convergence of a similar estimator will be n^^^ 
under conditions similar to Assumption 15.111 in which local alternatives violate the condi- 
tional moment inequality on a positive probability set. In these situations, the increasing 
sequence of weights for the KS statistic proposed in this paper will lead to a (n/ logn)^/^ rate 
of convergence for the set estimate. For estimators of the identified set based on kernel esti- 
mates of the conditional mean, if the sequence of bandwidth parameters is chosen properly, I 
show that the set estimate will converge at the same {n/ logn)°/*^'^^+^"^ rate as the variance 
weighted KS statistic based estimates, but the rate of convergence can be much slower if 
the bandwidth is chosen suboptimally. However, with the optimal sequence of bandwidths, 
power against local alternatives that approach the identified set at this rate will likely be 
greater for kernel based estimates. Thus, the results in this section show that the weighted 
KS statistic based estimates proposed in this paper do almost as well as an infeasible proce- 
dure that uses prior knowledge of the data generating process to choose the best from a set 
of other estimators. 

While the results in this section show that the truncated variance weighting allows KS 
statistic based estimates to adapt to a broad class of smoothness conditions, these statistics 
will not achieve the optimal rate of convergence when more than two derivatives are imposed 
on the conditional mean (although the results in Section [5751 show that KS statistics with the 
weighting in this paper also adapt to a broad class of tail behavior in cases of set identification 
at the boundary). The reason is that the KS statistics considered in this paper integrate the 
conditional mean against nonnegative functions, which prevents them from taking advantage 
of higher order smoothness conditions. Estimation methods based on higher order kernels 
or sieves would likely perform better in some of these situations, although some of these 
methods would fail to control the size of these tests when these smoothness conditions fail. 



6.1 Bounded Weight Functions 

Consider a set estimate based on a KS statistic similar to the ones considered so far, but with 
the weight function l/{a{0,g) V (7„) replaced by some bounded weight function ojn{0,g) = 
{(jjn,i{0,g), . . . .uju^Y^^^d))- Here, Un{0,g) is unrestricted, except for the requirement that. 
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for some uj we have gf)!! < uo for all n, 9, and g. Define 



965 



Following [Andrews and Shil ( l2009l ) (with additional conditions to control the complexity of 
rrijiWi, 9)gj{Xi) over 9 as well as g), Tn^^{9) will converge at a y/n rate, so define the estimate 
of the identified set for critical value c„ to be 

Under upper bounds on the smoothness of the conditional mean that correspond to the 
lower bounds given in Section HI upper bounds on the rate of convergence of set estimates 
based on KS statistics with bounded weights can be derived. These conditions are stated in 
the following assumption. 

Assumption 6.1. For some 9q G 5Qq{P) such that 9q is in the interior of Q, the following 
holds for some neighborhood B{9q) of 9q. (i) m{9,x,P) is differentiate in 9 with derivative 
mg{9,x,P) bounded over 9 G B{9q). (ii) For some rj > 0, we have, for all 9q G {6Qq{P)) fl 
B{9q), the set Xq{9'q) of points xq such that mink fnk{9'Q, xq, P) = satisfies 

\rhj{9'o,x, P) - rhji9Q,Xo, P)\ > r] {\\x - XqW" A r]) , 

for all j , and the number of elements in Xq{9q) is bounded uniformly over 9q. (Hi) Xi has 
finite support and a bounded density on its support, (iv) There exists a path t 9t such that 
(^t ^ 00 as t and t — )■ d{9t, 9q) is continuous for t in a neighborhood of 0. 

Assumption 16. II gives an upper bound on the smoothness of the conditional mean similar 
to the lower bound of Assumption 14.41 It states that a is the best (greatest) possible value of 
a for which Assumption 14.41 can hold. Without this assumption, rates of convergence derived 
using Assumption 14.41 and some value of a could be conservative, since the same assumption 
could also hold with a larger value of a. The next theorem uses this condition to get an 
upper bound on the rate of convergence of the set estimator C„,aj(cn) when the sequence of 
weight functions is uniformly bounded. 



Theorem 6.1. Under Assumptions \3.1[ \3.2[ \3.3[ \3.4\ and \6.1\ if Cn is bounded away from 
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zero and g{Xj) and m{Wi, 9) are uniformly hounded, then, for some e > 0, 



P (C„,.(c„), eo(P)) > e) 1. 

Under the smoothness conditions of Section |H this slower rate of convergence can be 
achieved (up to an arbitrarily slow rate of growth of the critical value) using bounded weights 
with an estimated set that contains 0o(-P) with probability approaching one. 



Theorem 6.2. Suppose that Assumptions EJl El [23 Q gJ| g]^ O and^^ hold. 



Let the weight function Un{0,g) satsify u < Un{0,g) < u for some < u < u < oo, and 
suppose that Cn — t- oo with Cn/ \fn — t- 0. Then 

mf^P(eo(P)CC„,^(cO)"4"l 

and, for B large enough, 

> b) "4°° 0. 



sup P ((n/c^)"/^^'^^^^") rf^(C„(c„), eo(P)) 



The n°/(2dx+2a) ^^^^ 

convergence for the estimator using bounded weights is slower 
than the (n/ logn)°/('^^+2") rate of convergence derived in Section H] for the estimator using 
the truncated variance weights. The rate of convergence is slower because sequences of 
local alternatives violate a shrinking set of moment inequalities. This leads to sequences 
of functions in Q with the most power having a shrinking sequence of variances, so that 
a bounded weighting function cannot give them enough weight. While the examples in 
Section [5] show that this case is likely to be common in practice, bounded weight functions 
will have advantages in other cases. Under conditions such as Assumption 15.111 for the 
selection model in Section 15.51 sequences of local alternatives lead to a single function in 
Q with positive variance having power. In this case, using a bounded sequence of weight 
functions does not cause such a problem, and the increasing sequence of truncation points 
does worse by a power of logn because of the larger critical value needed for the KS statistic. 



33 



6.2 Kernel Methods 



Suppose that we estimate the conditional mean Ep{mj{Wi,6)\Xi = x) 
the kernel estimate 



, X, P) using 



rhj{9, x) 



Enmj{Wi,9)k{{Xi-x)/hn) 

Enk{{Xi-x)/hn) 



for some sequence /i„ — )■ 0. IChernozhukov. Lee, and Rosen! (120091 ) and iPonomareval (120101 ) 
propose methods for inference oi i conditional moment inequalities base d on this estimate of 
the conditional mean. Following IChernozhukov. Lee, and RosenI ( 120091 ) this estimate of the 
conditional mean will converge at a a/ nh'^^ / log n rate uniformly over x. Using the results 
in this paper, this rate can be shown to be uniform over 9 as well, so that the statistic 



^n,TXW= sup SimiO,x)) 

x£suppp{Xi) 



can be used to form an estimate 



^ e e 



A/log n 



< On 



that will contain the identified set with probability approaching one for large enough. 

I place the following conditions on the choice of kernel function k. All of these conditions 
are fairly mild regularity conditions, except for the requirement that k be positive, which 
rules out higher order kernels. Ruling out higher order kernels is important. Since the class of 
KS statistics used in this paper integrate the conditional moment inequality against positive 
functions, these statistics cannot take advantage of smoothness conditions of more than two 
derivatives, while higher order kernels with a properly chosen bandwidth can. 

Assumption 6.2. (i) k is nonnegative (ii) k integrates to one, is bounded and square in- 
tegrable over M'*^ and k{t) is bounded away from zero for t in some neighborhood of (Hi) 
Assumption \3. 2\ holds with Q replaced by the class of functions t ^-7■ k{{t — x)/h) where x and 
h vary. 

As with set estimators based on KS statistics with bounded weights, the upper bounds 
on the smoothness of the conditional mean in Assumption 16.11 lead to upper bounds on the 
rate of convergence of estimates of the identified set based on kernel estimates. For the first 
order kernel estimates described above, estimates of the identified set will converge no faster 
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than estimates based on variance weighted KS statistics, and will only achieve the same rate 
if the tuning parameter is chosen to go to zero at the proper rate. Although this means 
that properly weighted KS statistics will generally do at least as well as first order kernel 
estimates and sometimes better in terms of rates of convergence, kernel estimates with a 
properly chosen sequence hn may do better against alternatives that approach the identified 
set at a given rate. 

The upper bound on rates of convergence for kernel based estimators is stated in the 
following theorem. In this theorem, the requirements that the critical value c„ be large 
and that the bandwidth hn not shrink too quickly ensure that the procedure controls the 
probability of false rejection. If these conditions do not hold, we may have 0o(-P) 2 Cj^'^'^'^(c„) 
with high probability asymptotically. 

Theorem 6.3. Suppose that Assu'mptions \4-5l \6. 1\ and \6.S\ hold. If Cn is chosen large enough, 
and if h'^nj logn > a for a large enough, then, for some e > 0, 




A h-^ I d^(C™(en),eo(P)) > e 1 1. 



viogn 



The upper bound on the rate of convergence in Theorem 16.31 is the slower of 



which comes from a variance term, and which comes from a bias term. The optimal 
rate of convergence for estimates based on first order kernels will be achieved only when 

hn = O ^]^sI!lY^^'''^^'^"\ Thus, choosing the optimal hn requires knowing or estimating the 
Holder constant a. While kernel based estimates may give more power when hn is chosen 
optimally, variance weighted KS statistics give the same rate of convergence as kernel based 
estimates with the optimally chosen hn without knowing a. If hn is chosen to go to zero 
at a different rate from the optimal rate for a given data generating process, kernel based 
estimates of the identified set will converge more slowly than estimates based on variance 
weighted KS statistics. If the choice of hn is far enough off from the optimal choice (i.e. if 
the researcher is wrong enough about the smoothness of the data generating process), even 
the rate of convergence for unweighted KS statistics in Theorem 16.21 will be better than the 
rate of convergence of the kernel based estimate. 
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7 Monte Carlo 



To examine the finite sample properties of tfie set estimates proposed in tliis paper, and 
to illustrate their implementation, I perform a monte carlo study. I apply the weighted 
KS statistic based set estimates to a quantile regression model with missing data on the 
outcome variable, where no additional assumptions are imposed on the process generating 
the missing values. Letting W* be the true value of the outcome variable, I simulate from 
a model where the median of W* given Xj = x is given by 6i + 62X, but W* is not always 
observed. This falls into the framework of the interval quantile regression model described 
in Section 15. 4[ with Wf^ = W^^ = W* when the outcome variable is observed, and W^^ = 00 
and Wl" = — 00 when the outcome variable is unobserved. The identified set contains all 
values of (^1,^2) that are consistent with the median regression model and some, possibly 
endogenous, censoring mechanism generating the missing values. 

I generate data as follows. For Xj and U* generated as independent variables with 
Xi ~ unif(-3,3) and U* ~ unif(-l, 1) and (ei,*,^2,*) = (1/4,1/2), I set W* = ^1,* + 
02,*Xi + U*. Then, I set W* to be missing (that is, (W^^, Wf^) = (-00, 00)) with probability 
1/5 - Xf/20 + X//200, and observed {W^^ = W^^ = W*) with the remaining probabihty 
1 — (1/5 — Xj^/20 + X//200). Note that, while the data are generated by taking a particular 
point (^^1,*, ^^2,*) in the identified set and using a censoring process that satisfies the missing 
at random assumption (that the event of W* not being observed is independent of U* 
conditional on X*), the identified set for this model is larger than a single point, and contains 
all values of {61,62) that are consistent with median regression and any form of censoring, 
including those where the probability of not observing W* depends on the outcome W* itself. 

Figure [1] shows the true conditional medians gi/2,p(W^j^|Xj = x) and gi/2,p(PVj^|Xj = x) 
as a function of x for this example. The true identified set 0o(-P) for this example is the 
set of parameter values {61,62) such that the line 61 + 62X is between these two conditional 
medians for every value of x on the support of Xj. Figure [2] plots the boundary of this 
identified set. The identified set consists of all points outlined by the shape in this figure. 

To illustrate the implementation of the set estimates in this paper applied to this model, 
I present a contour plot of the KS statistic evaluated at different values of the parameters for 
a single data set drawn from this data generating process. For a given choice of the critical 
value c„, the set estimate C„(c„) is then given by the set of points {61, 62) such that the KS 
statistic Tn{6) given in this plot is less than or equal to c„A/(log n)/n. In other words, each 
of the level sets in this plot gives the boundary of C„(c„) for some choice of c„, with the level 
sets for larger values of the KS statistic corresponding to larger (more conservative) choices 
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Figure 1: Conditional Medians of Wf^ and Wf" for Quantile Regression Model 
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Figure 2: Identified Set for Quantile Regression Model 



The contour plot is shown in Figure [31 This plot was formed from a single data set 
of n = 500 observations drawn from the data generating process described above. For the 
set of functions Q, I used the set of indicator functions for intervals I{s < X < t). For the 
truncation point cr„, for the standard deviation weights, I multiply 1/2, the standard deviation 
of a single Bernoulli (1/2) variable, by a/ (log n) (log log n)/n, a sequence that converges to 
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Figure 3: Contours of KS Statistic for Quantile Regression Model 



zero more slowly than ^J\ogn/n as required. I set 5(^1,^2) = niax(ti, ^2, 0). 

For the monte carlo, I use the same choices of (T„, and S. For the critical value c„, I use 
the slowly increasing sequence 2 A/log log n. Setting the critical value to 2 regardless of n leads 
to a somewhat conservative critical value for the case where Q contains a single function. 
As n increases, functions I{s < Xi < t) with s close to t are given increasing weight, so 
that the KS statistic behaves like the maximum of an increasing number of standard normal 
variables, and the critical value increases appropriately (or slightly faster than needed). I 
generate monte carlo data sets with the data generating process described above and n equal 
to 200, 500, and 1000 observations. I use 1000 rephcations for each monte carlo design. 

quant iles 



n 


.25 


.5 


.75 


.9 


.95 


coverage 


200 


0.45 


0.5 


0.54 


0.59 


0.62 


100% 


500 


0.34 


0.36 


0.39 


0.41 


0.43 


100% 


1000 


0.27 


0.28 


0.3 


0.32 


0.33 


100% 



Table 1: Summary Statistics for Hausdorff Distances for Monte Carlo 

Results are reported in Tables [H |2]and[3l In Tabled! I report, for each sample size, the 
coverage probability (the proportion of the monte carlo replications for which the estimate 
contains the identified set) and quantiles of the Hausdorff distance between the estimate 
and the identified set. Even for the smallest sample size oi n = 200, the set estimate con- 
tains the identified set for every monte carlo replication. For the data generating process 
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quantiles 





n 


.25 


.5 


.75 


.9 


.95 




200 


0.45 


0.49 


0.53 


0.57 


0.59 




500 


0.34 


0.36 


0.38 


0.41 


0.43 




1000 


0.26 


0.28 


0.3 


0.32 


0.33 




200 


0.33 


0.36 


0.41 


0.48 


0.52 


02 


500 


0.24 


0.25 


0.27 


0.28 


0.29 




1000 


0.19 


0.2 


0.21 


0.22 


0.22 



Table 2: Summary Statistics for Hausdorff Distances for Individual Parameters for Monte 
Carlo 



quantiles of ig^ quantiles of u^. 





n 


.05 


.1 


.25 


.5 


.75 


.25 


.5 


.75 


.9 


.95 




200 


-0.47 


-0.45 


-0.4 


-0.35 


-0.31 


0.81 


0.86 


0.9 


0.95 


0.98 


Oi 


500 


-0.31 


-0.3 


-0.26 


-0.23 


-0.2 


0.71 


0.74 


0.77 


0.79 


0.81 




1000 


-0.22 


-0.2 


-0.18 


-0.16 


-0.14 


0.64 


0.67 


0.69 


0.71 


0.72 




200 


-0.05 





0.05 


0.1 


0.14 


0.87 


0.9 


0.95 


1.01 


1.06 


02 


500 


0.16 


0.17 


0.18 


0.2 


0.22 


0.78 


0.8 


0.82 


0.83 


0.84 




1000 


0.22 


0.23 


0.24 


0.25 


0.27 


0.73 


0.75 


0.76 


0.77 


0.78 



Table 3: Summary Statistics for Set Estimates for Individual Parameters for Monte Carlo 

used in these monte carlos, Theorem 14.31 holds with a = 2, giving a rate of convergence of 
{(c^logn) /n)"^^'^^'^'^"^ = ((c^logn)/n)^/^ (this follows from Theorem 15. 4[ since the upper 
and lower conditional medians are bounded away from each other and have smooth second 
derivatives, and regression lines corresponding to parameters on the boundary of the identi- 
fied set are tangent to one of the conditional medians on the interior of the support of Xj). 
According to this result, the distance of the set estimate to the identified set should decrease 
by a factor of .77 going from n = 200 to n = 500, and should decrease again by a factor of 
.81 going from n = 500 to n = 1000. These asymptotic results give a decent approximation 
of the monte carlo results reported in Table [H although the Hausdorff distances in this table 
decrease slightly more quickly. 

The Hausdorff distance to the identified set summarizes the accuracy of the set estima- 
tor, but can be difficult to interpret, since it combines the accuracy of the estimate across 
both coordinates. Tables |2] and |3] report monte carlo results for the projections of the set 
estimate C„(c„) onto each coordinate. Define Oi,proj(-P) = {^1(^,^2) G 0o(-P) some 62} and 
02,proj(-P) = {^1(^1 7 0) £ 0o(-P) some ^1} to be the projections of the identified set onto each 
coordinate and Cn,i,proj(cn) = {6\{6,62) G Cn(c„) some 62} and Cn,2,proj(cn) = {6\{6i,6) e 
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C„(c„) some 9i} to be the corresponding projections of the set estimate. Let [ig^,ug^] be the 
smallest interval containing C„ i pi.oj(c„) for i = 1,2. In Table [2], I report quantiles of the 
realizations of rfH(Cn,i,proj(cn), 0i,proj(-P)) and dH{Cn,2,pvoi{cn), 02,proj(-P)) for the monte carlo 
replications. In Table [3l I report quantiles of ie- and ue.. 

Table [2] reveals that the set estimate for the slope parameter 62 has less sampling error in 
this case than the intercept parameter 61. Indeed, the Hausdorff distances in Table [1] appear 
to be driven mostly by the intercept parameter. While estimation error in 62 is generally 
smaller, than estimation error for the intercept parameter, the estimate for 62 appears to be 
shrinking towards the identified set for 62 at a similar rate. 

Table [3] summarizes the finite sample behavior of the confidence intervals [£$-, iig.] for each 
9i generated from the set estimate C„(cn). While these confidence intervals for individual 
coordinates contain less information than the confidence region C„(c„) for the identified set, 
they are less cumbersome to report and summarize. For comparison, the projection of the 
true identified set onto the intercept coordinate 62 is Oi,proj(-P) = [-17, .33], and the projection 
of the true identified set onto the slope coordinate 6*2 is 62,proj(-P) = [-47, .53]. Note that the 
confidence interval for the slope parameter 62 contains only positive values for 90% of the 
monte carlo replications even with the smallest sample size of 200 observations. Thus, one 
would correctly conclude that 62 is positive for an overwhelming proportion of realizations 
of the data even with a relatively small sample size, despite the conservative nature of the 
estimate C„(c„). 

8 Conclusion 

This paper proposes estimates of the identified set in conditional moment inequality models 
based on variance weighted KS statistics. I derive rates of convergence of these and other 
set estimators to the identified set under conditions that apply to many models of practical 
interest. In many settings, the rate of convergence of the set estimator I propose is the fastest 
among those available, and, in settings where other estimators are better, the improvement 
in rate of convergence is no more than a factor of logn. While, in most cases, there is some 
other estimator that does slightly better, choosing the correct one requires knowledge of 
smoothness and shape conditions on the data generating process, and guessing incorrectly 
about these conditions can lead the researcher to use an estimator with a much slower rate 
of convergence. The advantage of the estimator proposed in this paper is that it performs 
well under a variety of conditions without prior knowledge of which of these conditions hold. 



40 



In settings where local alternatives violate the conditional moment inequalities on a 
shrinking set, the weights I propose for KS statistics give the statistics more power against 
local alternatives than bounded weights. The examples in Section [5] show that this situation 
is common in practice. When sequences of local alternatives violate the conditional moment 
inequalities on a fixed, positive probability set, the larger critical values required by the 
increasing sequence of weight functions lead to a loss in power, but only by a factor of 
(logn)^/^. This provides a theoretical justification for variance weighting in this context. 
Under certain conditions, weighting the KS statistic objective function by a truncated inverse 
of the estimated variance increases the rate of convergence of the corresponding estimator 
of the identified set. 

A Appendix 

This appendix collects several results not stated in the body of the paper. In Section IA.lt I 
state and prove uniform convergence results for classes of functions weighted by truncated 
standard deviations. These results are used later in the appendix in proving some of the 
results stated in the body of the paper. In Section IA.2t I provide sufficient conditions for the 
rate of convergence to be strictly faster than ^Jn. In Section IA.3t I provide an example of a 
data generating process for an interval regression where low power against local alternatives 
when the slope parameter varies leads to a slower rate of convergence to the identified set. 
In Section lA.4t I state conditions under which Assumption l3.2l holds and verify them for the 
applications described in Section [5l Section IA.5I contains proofs of the theorems stated in 
the body of the paper. 

A.l Uniform Convergence Lemma 

The following lemma is useful in deriving some of these results. Applied to mean zero 
functions, the lemma says that any sequence of classes of functions that is not too com- 
plex converges uniformly at a ^Jnj log n rate when scaled by the standard deviation if the 
minimum standard deviation does not go to zero too fast. 

Lemma A.l. Let Zi, . . . ,Zn he iid observations and let V he a set of probability distributions 
and J^n,p d set of classes of functions indexed by n & N and P E V such that, for some f , 
fi^i) ^ / with P -probability one for P E V and f G Tn.p for each n. Let fi2,p{f) = 
{Epf{ZiYY/'^ and let H2,n be a sequence such that Ii2,n\/f^l logn is bounded away from zero. 
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Let Qri,p = {/At2,n/(/U2,p(/) V /i2,n)|/ ^ I'n,p} and suppose that 



sup sup sup N{e,Qn,p, Li{Q)) < Ae' 

PeV neN Q 



-w 



for < e < 1 where the supremum over Q is over all probability measures. Then for some 
B that does not depend on N , 



sup F —== sup 
Pep \ VlogTi /eJ-„,P 



Ai2,p(/) V /i2,n 



> B some n 



Proof. The result follows by applying the following theorem to the classes of functions 

gn,p. For g = //X2,n/(/U2,p(/) V /i2,„) G gn,P, Epg{Z,f = Epf{Z,f^4 J{^X2Aff V ^4,n) = 

/^2,p(/)^/^2,n/(/^2,p(/)^ V yU2,n) — /^2,n5 SO the theorem applies with the same /i2,n- n 

Specialized to a class P of probability distributions with a single element P, this says 
that the sequence in the probability statement in the last display of the lemma is bounded 
by B with P-probability one. The conclusion of the lemma implies that this scaled sequence 
is Op{l) uniformly in P G "P, but is slightly stronger. 

The proof of t he lemma uses the following theorem, which is a slightly stronger version 
of Theorem 37 in iPoUardl (119841 ). with the conditions stated in a slightly different w ay. The 



following theorem basically follows the arguments of the proof of Theorem 37 in 



Pollard 



(jl984j ). but changes a few things to get a slightly stronger result. Note that the notation 
ix\ p is used for the raw second moment of functions rather than their variance, although the 
distinction is often not important since applications typically involve the raw second moment 
going to zero at the same rate as the variance. 

Theorem A.l. Let Zi, . . . , Z„ be iid observations and let V be a set of probability measures 
and Tn,p o, set of classes of functions indexed by n ^ N and P G P such that, for some f , 
f{Zi) < f P-a.s. for f G J-'n,p for P eV for each n and, for some positive constants A and 
W, 



sup sup sVi\) N{e, J^n,p^ Li{Q)) < Ae 

PeP neN Q 



-w 



for < £ < 1 where the supremum over Q is over all probability measures. Suppose that, 
for some sequence fi2,n, Epf{Zif' < fi2n /^'^ all f G J^n,p for all P E V for all n. Then, if 
H2,n\/n/ \ogn is bounded away from zero we will have, for some B that does not depend on 
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N, 



sup P 
Pev \ /i2 



sup \iEn-Ep)f{Z,)\> B somen >n] 



0. 



Proof. The proof is a slight modification of the proof of Theorem 37 in iPoUardl (1l984[). The 



Pollard 



sequen ce H2,n corresponds to (5„ in that theorem, and, in contrast to the theorem from 
( 1l984j ) which defines a sequence a„ that must satisfy certain conditions, this theorem corre- 
sponds to using the best sequence possible, and noting that need not be nonincreasing 
as long as it is bounded. 

Without loss of of generality, assume that / = 1- Fix B (conditions on how large B has 
to be will be stated throughout the theorem) and set = Since varp{{En — 

Ep)f{Z,))/{Ael) < (/ii Jn)/(45V|Jlogn)/(64n)) = 16/(52 log n) < 1/2 fo r n greater 
than some number that does not depend on P, the inequality (30) in iPoUardl (jl984j ) will 
eventually imply 



PI ^-^== sup \{Err-E)f{Z,)\>B]=P\ sup | (P„ - P)/(Z,) | > 8^, 



<4(Pxz/) sup |P°/(Z,)| >2£, 

for all P G P where P°/(Zi) = ^ Yl^=i fi^i) ' ^"^^ Si, . . . ,Sn are iid random variables that 
take on values ±1 each with probability one half drawn independent of Zi, . . . , Zn and u 
denotes the probability measure of si, . . . , Sn- Conditional on the data, this is bounded by 



(Pxz/) ( sup |P°/(Z,)| >25„ 



For any constant a > 0, on the event that 



Zi, . . . , Z J < 2Nien, J'n,p, (P„)) exp 



1 



net 



sup Enf{Zi) < a /i2 , 



(5) 
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the previous display will be bounded by 



2Aexp 
2Aexp 



-- ■ n ■ ■ ■ -^—^ W log 

2 QAn ci h'-i n 

B^hgn B /^2,nVlogn' 

VV log — — W log — 



n 



128a2 



n 



The condition that jj2,n\/n/ \ogn is bounded away from zero is more than enough to guar- 
antee that the term in the last logarithm is bounded from below by a fixed power of n. Thus, 
the expression in the last display can be made to go to zero at any polynomial rate for any 
a by choosing B to be large enough (in a way that depends on a but not n or P). 

For any P £ ^ i the P-probability of failing to hold can be bounded using Lemma 33 



m 



PollardI ( 1l984l ) with (5„ = a^2,n/8 (the lemma holds for a > 8): 



P \ sup E^f{Z,y > a'lii^ = P sup EJ{Z,Y > 645^ ) < 4Ep[iV((5„, J-„,p, L2(P„))] exp(- 



< AA{5n/2)-^ exp{-n6'^) = 4 ■ 2"^ Aexp{-n5't, - W\og6n 



w 



= 4-2^Aexp 
< 4-2^v4exp 



-na^ni „/64 - log - - log /i2,n 



na^clogn a W clogn 

. log - - — log 

64 n 8 2 n 



where ^/c is a lower bound for jj,2^n\/n/\ogn. This can be made to go to zero at any 
polynomial rate by choosing a large. 

Thus, if we choose a and B large enough, suppgp P ^logn f&J^n p I (-^" ~ Ep)f{Zi) \ > B 
will be summable over n, so that 



sup P 



n 



sup \{En- Ep)f{Z,)\>Bsom.en>N 



< 



/i2,„yiogn /ej". 



sup |(P„-Pp)/(Z,)| >P I '^4°°0. 



□ 



With this lemma in hand, we can get rates of convergence for classes of functions weighted 
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by their standard deviation under additional conditions that allow the standard deviation 
to be consistently estimated. In order to get results for functions weighted by the standard 
deviation rather than the raw second moment, I apply the previous results to classes of 
functions of the form / - Epf{Zi). Letting aiff = En{f{Zi)f - {Enf{Zi)f and <Tp{ff = 
Ep{f{Zi)Y — {Epf{Zi)y, rates of convergence for 



sup 
feT„ 



'En — El 



will follow by applying the above results to the classes of functions / — Epf{Zi) once we 
can bound and for this it is sufficient to show that o-{f)/ap{f) converges to one 

uniformly over crp{f) > an- The following lemma gives sufficient conditions for this. 

Lemma A. 2. Let Z\.....Zn he iid observations and let J-'„ be a sequence of classes of 
functions and V a set of probability distributions such that, for some f , f{Zi) < f with P- 
probability one for P eV and f E for each n. Let ap{f) = {Epf{Zif - {Epf{Zi)fy/^ 
and let an be a sequence such that an\/n/ logn is bounded away from zero. Define Q\p = 
{{f-Epf{Zi))aJ{ap{f)yan)} andQlp = {{f ~ Epf{Z,)faJ {fi^AU -Epf{Zi)f)y (Jn)} , 
and suppose that, for some positive constants A and W , 



sup sup supA^(£,^^p,Li(Q)) < As 

PeV neN Q 



-w 



for < £ < 1 and i = 1,2, where the supremum over Q is over all probability measures. 
Then, for every e > 0, there exists a c such that, if Un^/n/ logn > c for all n. 



sup P I sup 

Per \f&T„,ap{f)>an 



^p{f) 



- 1 



> e some n> N \ ^ 0. 



Proof. We have 



sup 



— sup 

/eJ'„,(Tp(/)>o-„ 

< sup 



(E„ - Ep)U{Z,) - Epf{Zi)f - (EJiZi) - Epf{Zi)f 



{En-Ep){f{Z,)-Epf{Z,)f 



- Ep)f{Z,)f 



(6) 
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The first term is equal to 

(E, - Ep)if{Z,) - Epf{Z,)f /i2,p([/ - Epf{Zi)f ) V a„ 



We have /i2,p([/ - Epf{Z,)] 
^T<^pU? so that 



2\2 



Ep[/(Z,) - Ep/(Z,)]^ < 4/ Ep[/(Z,) - Ep/(Z,)]^ 



/^2.P([/ - igp/(^.)]^) V ^ WMf)] V ^ 2/ V 1 ^ 2/ V 1 

where the last two inequalities hold for (7p{f) > cr„. Thus, for any £ > 0, 



sup P I sup 

P&V V /eJ-„,f7p(/)><7„ 



PI 2/Vl 
< sup F sup 



{E^-Ep){f{Z,)-Epf{Z,)f 



> e some n > N 



{En-Ep){f{Zi)-Epf{Z,)f 



< sup P 



sup 



n 



PeP V/eJ-n,ap(/)><7„ Vlogn 



{E[{f{Z,)-Epf{Z,)y]^Y'/'^ya„ 
{En-Ep)U{Z,)-Epf{Zi)f 



{E[U{Z,) - Ep/(Z,))2]2}(V2) ^ 



> e some n > N 



> ce/{2f V 1) some n> N 



where the last inequality holds for cr„ ^yn/\ogn > c. By Lemma IA.lt this will go to zero if c 
is large enough so that ce/ (2/ V 1) is greater than the B for which the conclusion of Lemma 
lA.ll holds for the class p. 

The probability that the second term in the last line of Equation [6] is greater than 
e > for some n > N goes to zero uniformly in P G P by Lemma lA.ll with the class 
{/ — Epf{Zi)\f G J^n} taking the place of J-'n,p in that lemma. □ 

Combining these lemmas gives a consistency result for classes of functions weighted by 
their standard deviations. The conditions are the same as those for Lemma [A. 2 1 

Lemma A. 3. Let Zi, . . . , Zn be iid observations and let Tn be a sequence of classes of 
functions and V a set of probability distributions such that, for some f , f{Zi) < f with P- 
probability one forPeV and f e for each n. Let ap{f) = {Epf{ZiY - {Epf{Zi)YY/'^. 
Define Q^p = {(/ - Epf{Z,))aJ{ap{f) V and Q^p = {(/ - Pp/(Z,)) V„/(/i2,p([/ - 
Epfi^Zi)]"^) V (Jn)}, and suppose that, for some positive constants A and W , 



sup sup sup N{e,Ql^ p, Li{Q)) < As 

PeV neN Q 



-w 
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for < £ < 1 and i = 1,2, where the supremum over Q is over all probability measures. 
Then, for some B and c that do not depend on N or P , if On^nj log n > c for all n, 

fiZ,) - Ep{f{Z,)) 



n 



sup P sup ■ 

P&v V/e-^n V log n 



a{f) V a„ 



> B some n 



Proof. We have 



P sup 



n 



f{Zi) - Ep{f{Z,)) 



= P I sup 
< P ( sup 



f{Z,) - Ep{f{Z,)) 



> B some n > N 
ap{f) V a, 



o-p(/) V an 
f{Z,) - EpU{Z,)) 



a{f) V a 
> 5/2 some n>N 



> B some n > N 



+ p( inf ^^/"j/ ^" < l/2some n > ivV 
V/e^n ap{f) Wan J 

The second to last line goes to zero uniformly in P G P by Lemma [A . 1 1 applied to the classes 
{/ — Ep{f)\f G J-'n, P G (here, P must be chosen large enough so that the conclusion of 
this lemma holds with B replaced by B/2). Since J^^^jy^g ^ 1 > 1/2 when ap(f) < an, the 
last line is bounded by 



inf 



Hf) 



/GJ-„,<7p(/)>a„ ap{f) 



< 1 /2 some n > N 



which goes to zero uniformly in P G P if o"„ ^Jn/ log n > c for c large enough by Lemma IA.2I 

□ 



A. 2 Conditions for Exact Rate of Convergence 

If cr„ is fixed, we will have a y/n rate of uniform convergence for the KS statistic. The 
y/n/ logn rate of convergence results used in Theorem 13.11 do not rule this out for the case 
where an goes to zero, but another argument shows that the rate of convergence will be 
strictly slower than ^/n in many situations. 

Assumption A.l. For some 9 G Bo(P), some j , and some open set X , the following hold, 
(i) Ep(mj{Wi,9)\Xi) = a.s. on X and Xi has a density fx{x) on X that is bounded 
from above and from below away from zero, (ii) var{m{Wi,9)\Xi = x) is continuous as a 
function of x and bounded away from zero and infinity on X . (Hi) Q contains the function 
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t k{(t — x)/h) for all x and all h less than some fixed positive constant where k satisfies 
Assumption \6.2\ and is continuous at zero. 

The assumption on the set of functions Q covers many commonly used cases, including 
indicator sets for dx dimensional rectangles or boxes. 

Theorem A. 2. // Assumption holds and S satisfies Assumption \3.4l then, if an 0, 
y/nTn{9) will diverge to oo. 

Proof. Fix any points Xi, . . . , G A". For k from 1 to let gn,k{t) = k{{t — Xk)/hn) 



■'n,k 



(^n,j{0,9n,k) V (Tn 



EnmjiWi,e)k{iXi-Xk)/hn) 



where hn is a sequence going to zero such that hf^^'^ /an goes to infinity and hf^^"^ > n^" for 

some a < 1. By the assumption on S, y/nTn{9) will diverge to oo if i^ii^hj ^ .(^glh)vo- ^nf^jiWi, 6)k{{Xi 

x)/h) diverges to — oo, and, for this, it is sufficient to show that mink Zn^k can be made 

arbitrarily small asymptotically by making i large enough. Using standard arguments, 

it can be shown that anj{0, gn,k)/<^p,j{(^, gn,k) converges in probability to one, and, since 

apj{9, gn,k)/hn converges to a constant under these assumptions, we also have that 

Zn,k = - — 7^ -Enmj{Wi,6)k{{Xi -Xk)/hn) 

C"n,jlc/, gn^k) 

with probability approaching one. By the Lindeberg central limit theorem, defining 

Zn,k = 7^ -Enmj{Wi,6)k{{Xi -Xk)/hn) 

cp,j{t), gn,kj 

{y/nZn^i, . . . , y/nZn/) couvcrgcs to a vector of independent standard normal variables, so, 
since each Zn,k is eventually equal to Zn,k times something that converges to one, {^/nZn,l, . . . , y/nZn,e) 
also converges to a vector of independent standard normal variables. Thus, min^ ^JnZn^k 
converges to the minimum of i independent standard normal variables, which can be made 
arbitrarily small by making I large. 

□ 



A. 3 Rates of Convergence for Slope Parameters 

In this section of the appendix, I present a counterexample that shows that a condition 
along the lines of part (iii) of Assumption 15.41 is necessary to obtain the rate of convergence 
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in Theorem 15.21 As discussed below, a similar counterexample shows that a condition on the 
parameter space B such as Assumption 15 . 31 is necessary in Theorem 15 .11 These counterexam- 
ples also show that the first display in Assumption l4.4l cannot be replaced with an assumption 
that only takes into account the magnitude of the derivative vector. Consider an example 



where Ep{Wl^\X, 



x\ Ep{Wt\Xi 



X) 



-x^ variyVl^\Xi) = mr(W^^|Xi) = 1, and 



Xi is has a uniform distribution on [—1/2, 1/2]. Suppose that we use the set of functions 
{/(s < Xi < s + t)\s eR,t> 0}. 

In this case, the identified set is a single point (0,0). Consider the sequence of local 
alternatives given by 6'„ = (0, 6„). We have, for all s, t with — 1/2 < s < s + 1 < 1/2, 

EpliWl" - hnXi)I{s <X,<s + t)]= Ep[{X^ - bnX,)I{s <X,<s + t)] 

rs+t 

hnx)dx= / [{x-hn/2f -hl/A]dx 



[x 



1-1/2 

> I [n' 

'-t/2 



hl/A] du = 2 



1 62 ^t/2 
3 4 



2t 



u=0 



1 6^ 
24 8 



and 



varpliWl" - hnXi)I{s <Xi<s + t)]> Ep{mrp[(W^^ - hnXi)I{s < X^ < s + t)\Xi]} 
= Ep[I{s < Xi < s + t)] = t. 



Thus, for s, t such that Ep[{Wl^ — hnXi)I{s < < s + 1)] is negative 

EpliWl" - bnXi)I{s <Xi<S + t)] 



{varpiiWi" - hnXi)I{s < X^ < s + t)]yl^ 



< 2^'^ 


1 6^ 




24 8 



< 



ol/4 



A symmetric argument applies to moments based on Wl". For some constant K, this sequence 
of local alternatives will be in C„(c„) if Ih!"^ < K{{\ogn) /nY/"^ iff. 6„ < K{{\ogn) / nY/^ . In 
contrast, convergence to the identified set for one sided regression will be at a ((log?T,)/n)2/^ 
rate if the parameter space 9 is restricted so that the absolute value of the slope parameter 
cannot be too large. 

Now consider the one sided regression model of Section [5?T] with Ep{Wl^\Xi = x) = x"^ 
and the parameter space G given by [0, oo) x M. That is, the parameter space G incorporates 
the prior knowledge that the intercept is nonnegative. Again, the identified set is the point 
(0, 0), and the Hausdorff distance between the set estimate C„(c„) and the identified set will 
be at least 6„ if C„(c„) contains the point (0, By the same argument used above, (0, 6„) 
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will be in C„(c„) for some sequence 6„ going to zero at a {{\ogn)/ny^^ rate, so that the rate 
of convergence of C„(c„) to the identified set will be no faster than ((log ?T,)/n)^/^, which is 
slower than the ((logn)/n)^/^ rate given by Theorem 15.11 when the intercept is not restricted. 
Note that, in the case where the intercept parameter is not restricted a priori, the sequence 
of local alternatives (0, 6„) will still be in the estimate C„(c„), but the distance of these points 
to the identified set will no longer be equal to 6„, since the identified set will contain a point 
{9'{bn),bn) for some 6''(6„) that is smaller in magnitued than 6„. 



A. 4 Covering Number Conditions 

In this section, I state some simple sufficient conditions under which Assumption 13.21 holds. 
I first prove that Assumption 13.21 holds under individual bounds on the complexity of the 
classes Q and {w t— )■ 171(10,6)16 G O}. The proof of this result uses Lemma [A. 41 stated and 
proved at the end of the section. I then provide examples of classes Q that satisfy these 
bounds, and show that the class {w h-j- m{w, 6)\6 G 6} satisfies these bounds in each of the 
applications covered in Section [H Throughout this section, I define J-'m = {w m{w, 6)\6 G 
0} to be the class of moment functions indexed by 6. 

The following theorem translates bounds on the covering numbers of the classes Q and 
{w ^ m{w, 6)\6 G 6} to the conditions of Assumption 13. 2[ 

Theorem A. 3. Suppose that the classes J-'m = {w n- m{w,6)\6 G 0} and Q are uniformly 
hounded and satisfy supq N{e, J^m, Li{Q)) < Ae~^ and supq N{e,Q, Li{Q)) < Ae^^ for 
some A,W > where the supremum is over all probability measures Q. Then Assumption 
EJ holds. 

Proof. The result follows immediately from Lemma IA.41 since the classes of functions in 
Assumption 13.21 are sums and products of these bounded classes and bounded classes of 
constant functions, which also have polynomial uniform covering numbers. □ 

With this result in hand, we can v e rify A ss umption 13.21 for a particular mode l and choice 



of Q using results stated in iPoUardl ( 1l984j ). Ivan der Vaart and Wellnerl (Il996l ) and other 



sources. For convenience, I do this here for some choices of Q. 

Theorem A. 4. Suppose that J-^ = {w ^ m{w,6)\6 G 0} and Q are uniformly bounded 
supg N{e, J^m, Li{Q)) < Ae~^ . Then Assumption \3.2\ will hold for the following classes of 
functions Q : 
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(i) The class of indicator functions Q = {x ^ I{x G V)\V G V} for any VC class of sets 
V. 



(a) The class of dilations of a kernel function k given by Q = {x k{{x — t)/h)\x G 
W^^jh G M+} for any kernel function k given by k{x) = t(||x||) for a decreasing, 
bounded function r on IR+. 



Proof . The covering number bound for Q in Theorem IA.3I holds by Lemma 25 in 



Pollard 



( 119841 ) (since a VC class of sets h as polynomial discrimination) for part (i), and by problem 
18 in Chapter 2 of lPoUardl (|l98J) for part (ii). □ 



See 



Pollard! (jl984j ) for the definition of a VC class and examples of VC classes of sets. 



The class of all dx dimensional rectangles falls into this category. The condition that 

the class of functions J-'m = {w m{w,9)\9 G 9} satisfy the covering number bound 

SUPg N{e, J^rn, Ll 

such as those in 



Q)) < As ^ can b e verified on a case by case basis u sing general results 



PoUardl (Il984j ) and Ivan der Vaart and Wellnerl (119961 ). I do this for the 



examples in this paper in the next theorem. 

Theorem A. 5. The class of moment functions J-'m = {w m{w,6)\6 G 0} satisfies the 
covering number bound supg N{6, J^m, Li{Q)) < Ae~^ in all of the models of SectionlE as 
long as the data are bounded and G is compact in the conditional mean models of Sections 

Proof. The class {w i— )■ m{w, 6)\6 G 0} has VC subgr aph for all of the models of Section [5l 



so the result follows from Lemma 25 in 



PoUardl (Il984h . 



□ 



The proof of Theorem 



A. 3 1 us es the following lemma, which modifies an argument from 



van der Vaart and Wellnerl (119961 ) 



Lemma A. 4. Let T , Q and H be classes of functions bounded by a fixed constant B, and 
let J' ■ g + n = {f ■ g + h\f G J',g e g,h e U}. Suppose that, for some A,W > 
supg A^(£:, J-", Li((5)) < Ae~^ , where the supremum is taken over all probability measures, 
and that the same statement holds with T replaced by Q and Then supg A^(e, J-" ■ Q + 



n,L,{Q)) < A^{2B + 1; 

measures. 



, where the supremum is again taken over all probability 



Proof. The result follows f rom a n argument similar to the proof of Theorem 2.10.20 in 



van der Vaart and Wellnerl (1l996l ). Given e > and a probability measure Q, let kjr^Q 
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N{e, J^, Li{Q)) < supg, N{£, T , Li{Q')) and let /i^q, . . . , fkjr^Q,Q be such that, for all f E 
there exists a fi^q such that Eqlfi^Q^Zi) — f{Zi)\ < e (here, the notation EQf{Zi) refers to 
the expectation J f{z) dQ{z) of f{Zi) for a random variable with distribution Q). Define 
kg.Q^ kH,Q, 9i,Q, • • • , 9kg,Q,Q and /ii,q, . . . , hk^ ^^q similarly. For any fg + h E T ■ G + H, there 
is some jj-, jg and jn such that Eq|/j^,q(Z,) - f{Zi)\ < e, EQ\gj^^Q{Z,) - g{Zi)\ < e and 
-^Ql^i-H,Q(^«) ~ < We have, for all z, 

\f{z)g{z) + h{z) - Uj^MgigM + hj^^Q{z))\ 

= l(/(^) - /j^,Q(^))^(^) + iaiz) - gjg,Q{z))fj^^Q{z) + h{z) - hj^^Q{z))\ 

< l/W - /..,qWI ■ \9iz)\ + \9i^) - 9,gM\ ■ \f,,M\ + \K^) - /^.h,qW)I 

< l/W - /..,qWI ■ 5 + W) - (7,„q(2;)| ■ B + |/^(^) - 



so that 



EQ\f{Z,)g{Z;) + /.(Z,) - + /^,,,q(Z,))| 

< {EQ\f{Z,) - f,,,Q{Z,)\ + - g,g,Q{Z,)\)B + i^;Q|/i(Z,) - /i,„,q(Z,))| < (25 + 1)5. 

Since Q was arbitrary, it follows that supg N{{2B+l)e, J-'-Q+'H, Li{Q)) < (supg N{e, T , Li(Q)))- 
(supg N{e, g, Li(Q))) ■ (supg N{e, G, L^iQ))) < Ah'^^ . Replacing e with e/{2B + 1) gives 
the result. □ 



A. 5 Proofs 

This section of the appendix contains proofs of the results stated in the body of the paper. 

proof of Theorem \3.1[ If Oo(-P) ^ C„(c„), then, for some 6o G 9o(-P), \/n/ log nTn{6o) > c„ 
so that for some g E Q, 



{6,g)\J an) y/n 

so that, for some j, -^^fg^ < -^^^i^5,i- Since 9, E Qo{P), Epm{W,,e,)g{Xi) > 0, 
so this implies that 

n {En-Ep)m{W,,eo)g{X,) „ 

< -C„A«i. 



^ ^njif^^ 9)^ '^ri 
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Thus, 60 (-P) ^ Cn{cn) imphes that the above display holds for some 6^0, g, and j. UK is 
large enough so that the conclusion of Lemma [A.3I holds for B = K ■ Ks^i- and c from that 
lemma equal to K, the probability that there exist some 9^ G 0o(-P) and g & Q such that 
this event holds and c„ and is greater than K will be bounded by a sequence that goes 
to zero uniformly m. P eV. 

□ 



proof of Theorem \4 ■ 1\ If dH{Qo{P)iCn{cn)) > e and 0o(-P) ^ Cn(c„), then there exists 



some 9 G Cn{cn) such that (i//(6', 9o(-P)) > e. Letting 6 be such that, for all P E V, 
EprrijiWi, 9)gj{Xi) < —6 for some j and g E Q, this implies that, once anj{9', g) is bounded 
uniformly in {9',g) by some a (this happens with probability approaching one uniformly in 
P G P by Lemma O]), 

-Tni9) < {Er,m,iW,,9)g,iX,) V 0) < - sup |(E„ - Ep)mfe(W^„ • 

The probability that supg/g j^ \ {En — Ep)mk(Wi,9')gk{Xi)\ < 6/2 goes to one uniformly in 
P G P by Lemma [A. H and once this holds, the above display will imply Tn{9) > 6/ {2Ks^2^). 
This cannot hold for 9 G C„(c„) for c„a/ (log n)/n < 6/{2Ks,20'), and the probability of this 
holding goes to zero uniformly in P G "P. 

□ 

proof of Theorem^ If dj,(eo(P), C„(cO) > B (^)^^', Q,{P) C C„(c„) and rfH(C„(c„, eo(P)) < 
5 (the latter two events hold with probability approaching one uniformly in P G P by Theo- 

rems l3.1l and l4.ip . then there exists some 9 G Cn{cn) such that (iiy(6', 6o(P)) > P ' 



For this ^ (and P), there will be, by Assumption \A.2\ a g* E Q and j* such that 
f^p,r{^,9*) < -(C/2)pi/^ c2 logn^ 



c2 log w 



(replacing C with C /2 takes care of the possibility that the infimum in the assumption is 
not achieved) and, by part (ii), for some constant rj > that does not depend on P, this will 
eventually imply 

f^p,j<0,9*) < _(C/2)pi/^ ^ c^logn^ 



(Tpj*{9,g*)\/ {r]an) \ n 
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so that, letting Ci = {C/2){'q A 1), we will have ^^^pl^^gJ)*!,^ < -CiB^'^ (^^^^^ ^ gi^^^^ 
e e C„(c„), we will also have T„(e) < c„ so that, for all ^ G ^ and all j, - > 



-Ks,2Cn H-TY'^. By Lemma D this will also imply ^^^g^ > (i^)'^' with 

probability approaching one uniformly in P G P. When these events all hold, we will have 

KA^^9*) ^^Pri^^9*) ^ Ks,2Cn f \ogn\ ^'^ ^ ^^^1/7 ( ^'^ 

apj*{6,g*)y an apj*{6, g*) \/ an ~ 2 \ n J ^ \ n J 



so that 



in 

sup 



_ — 

eee,gGe,je{i,...,j} vlog 



n 



apj {6, g) V an apj {9,g)y an 



>Cn{B'/^Ci-Ks,2/2). 



Since c„ is bounded away from zero, we can choose B large so that Cn{B^^'^Ci — is 
large enough so that the conclusion of Lemma lA.ll holds with B from that lemma replaced 
by Cn{B^^"'Ci — Ks^2/'^)- For this value of B, the probability of the last display holding will 
go to zero uniformly in P G P so that the desired conclusion will hold. □ 

proof of Theorem \4.3[ It is sufficient to find a C such that, given 6 and P, there exists a 
9o{9, P), jo{0, P), and a g E Q such that 

(^P,j{0,9) V d[e,eQ[P))^n 

Given 6 and P, let 6*0 (6*, P) and jo(^) P) be chosen as in Assumption 14.41 To avoid cumber- 
some notation, I will use 6q and jo to denote 6^0 (6*, P) and jo(^) P) when the dependence on 
9 and P is clear. For this 6q and jo, we will have, for ||x — xo|| < r], 

rhj^{e, X, P) = mjo(6', x, P) - mj„(6'o, Xo, P) 
= ['^jo(^'a;,P) - mj„(6'o,x,P)] + [rhj^XOo^x, P) - mj,(6'o, Xo, P)] 
< me,j,{e\x, P){e - 60) + C\\x - xoir 

for some 6* between 6 and ^o- By Assumptions 14.31 and 14. 4[ for ||^^ — ^oll ll^~^o|| smaller 
than some constant that does not depend on P or 9, this will be less than or equal to 

-(V2) ii^^-^oii+ciix-xoir. 
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For ||x - xoll < [r7/(4C)]V"||^ _ 0o||V"^ this is less than or equal to -{ri/A)\\9 - 9o\\. Thus, 
letting g E Q he as in Assumption 14.61 with s = Xo and t = [r]/(AC)Y/°'\\9 — 6'o||^/" so that 
9{x) < - xoll < [r]/i4C)Y/^9 - 0o||'/") and g{x) > Cg,J{\\x - xo\\ < [r//(4C)]i/-||^^ - 
^oir^"C'g,2), we will have 



fiPj,{9,g) = Epmj,{9,Xi,P)g{X,) < -{v/mO - Oo\\Epg{Xi) (7) 



and 



apj,{9,g) = {varp[m,,{Wi,9)g{X,)]}'/' < {Ep[m^,{Wi,9)giX,)]'y^' 
<Yt^' {Epg{X,)y/\ 

The lower bound on g implies that {Epg(Xi)}^^'^ is greater than some constant that does 
not depend on P times \\9 - 6'o||'^^/^^") > d{9, 6'o(P))''-^/^^"^ Thus, for some constant K that 
does not depend on P, apj,{0,9) V d{9,9o{P)y^/^^'''^ < K{Epg{X,)y/^ . Thus, 

f^pjoi^^a) ^ zMll iifl n \\\F n(Y^^^/^ 

< ^^^\\9 - 9o\\C'JJP{\\x - xoW < [v/m]'^''\\0 - ^oir/"Cg,2}'^' 

< - do\\c'J,W^'{[v/{^c)]y"\\9 - eoir/"c,,2}'"/' 

where the second inequality follows from the lower bound on g. This is equal to a negative 
constant that does not depend on P times ||^ — 6'o||*-'^^'''^"^''*-^°\ so that Assumption 14. 21 holds 
with 7 = 2a/ {dx + 2a) and ip = dx/{dx + 2a). 

□ 

proof of Theorem \5.1[ Assumption 14. 31 holds because rh{9, x) is linear, so it remains to verify 
Assumption 1131 Given 9 e & and P e V, let Xo{9,P) minimize Ep{W^"\Xi = x) - 
9i — x'9-i over the support of Xj, and let t(6', P) be the minimum (the minimum is taken 
since E{Wl^\Xi = x) - 9i - x'9^i is continuous). Let 9o{9,P) = {9i + t{9,P),9^i). Then 
m{9o{9,P),x,P) = E{Wi^\Xi = x) - 9i - t{9,P) - x'9_i so that 9o{9,P) G eo(P) and 
m{9o{9, P), Xo{9, P),P) = 0. We have 

mem9, P),xo{9, P), P){9 - 9o{9, P)) = -(1, xo{9, P)'){-t{9, P), 0, . . . , 0)' 
= t{9, P) = -II {t{9, P), 0, . . . , 0)'|| = -11^^ - 9o{9, P)\\ 
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where the second to last equahty holds because t{9, P) is negative by definition of the identi- 
fied set. The Holder continuity part of Assumption 14 . 4 1 is immediately implied by Assumption 
15.11 Under Assumption 15.21 xq{6,P) must be on the interior of the support of Xi by part 
(ii) of this assumption. Thus, xo{0,P) is an interior minimum of the twice differentiable 
function x (—)■ Ep{Wl^\Xi = x) —9i —x'6^i, so the first derivative of this function at xo{9, P) 
is zero. This and a second order mean value expansion of this function around xo{0, P) imply 
the Holder continuity part of Assumption 14.41 with C a bound on the norm of the second 
derivative matrix. 

□ 

proof of Theorem \5.2 . Everything is the same as in the proof of Theorem 15.11 except for 
the verification of the first part of Assumption 14.41 For any 6, either {6[,62) is in Oo(-P) 
for some 6', in which case the same argument to verify Assumption 14.41 goes through, or 
02 > 02 or 02 < 02, where 02 = sup{^2|(^i, ^2) e eo(P) some 0^} and 02 = inf{^2|(^i, ^2) e 
Oo(-P) some ^i}. Suppose that 02 > 02 (the case where 02 < 0_2 is symmetric). Then, for some 
6'^, we have {0[,02) G 0o(-P), and, for some xo,2 < a;o,i, E{Wl^\Xi = xo,i) = 0'i + Xo,i02 and 
E{W^^\Xi = xo,2) = 01 + 3^0,2^2- We have m0^i{0,x, P) = — (l,x) and Th0^2{0,x, P) = {l,x), 
so that 



m,,i(^^,xo,i,P)(^^- (^^'1,^2)) = -(l,xo,i)(^- (^'i,^^2)) 



and 



me,2{0,xo,2,P){0-{0[,02)) = {l,xo,2){0-{0[,02)). 

If the sum of the expressions in these two displays is less than —2ri\\0 — {0[, 02) \\ , at least one 
of them must be less than — 77II6' — {0[, 0)\\, so it suffices to bound 

[(l,Xo,2) - (l,a:o,i)](^ - {0[,02))/\\0 - {0[,02)\\ = 

[{01 - 0[r + {02 - 02)T 

For this, it suffices to bound xq^i — xo,2 away from zero and \0i — 0[\/\02 — 02\ away from 
infinity. 

2^0,1 — a;o,2 is bounded away from zero by parts (ii) and (iii) of Assumption 15.41 For 
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parameter values where |^i — ^i|/|^2 — ^2! is large, we can use another argument. Note that 

-(i,xo,i)(g-jg;,g2)) ^ + 3:0.1(^2-^2) ^ (g,-g;)/(g,-g,) + xo,i 

11^^ - (^^i ^2)11 11^^ - (^^1,^2)11 [{6, - e[)y{e2-e,y + 1]'^' 

and, similarly, 

(l,Xo,2)(g-K,g2)) _ i9l-9[)/{92 - 02)+Xo,2 

11^ -(^'1.^2) II [(^^1-W(^2- ^2)^ + 1]'^'' 

For \9i — 9[\/\92 — 92\ > 2max{|xo,i|, |a;o,2|, 1}, one of these displays will be less than —1/4. 

□ 

proof of Theorem \5.3[ For Assumption 14.31 note that 

me{9, X, P) = ^Ep[t - /(W^^ < ^1 + X;^_i)|X, = x] = -^PiW^" < 9, + X[9.^\X, = x) 
do du 

= + a;'6'_i|x)(l,a;'). 

This is continuous as a function of 9 uniformly in (6', x, P) by Assumption 15 . 71 and the bound 
on the support of Xj. 

To verify the first part of Assumption 14.41 let Xq{9^ P), t{9, P) and 9q{9, P) be defined as 
in the proof of Theorem 15. 11 but with Ep{Wl^\Xi = x) replaced by qr^p{Wl^\Xi = x). Then 
9o{9,P) G 0o(P) and 

m{9o{9, P),xo{9, P),P) = r- P{W^ <9^+ t{9, P) + X;^_i|X, = xo{9, P)) = 

since g,,p(W;^|Xi = xo{9, P)) = 9i + t{9, P) + xa{9, P)'9^i. We also have 

me%{9, P), xo{9, P), P){9 - 9,{9, P)) = -fw»\xS^i + x'9.,\x){l, x')i-t{9, P), 0, . . . , 0)' 
= fw»\xS^l + ^'9-l\x)t{9,P) = -fwH^^X9^+x'9^,\x)\\9-9o{9,P^ < -l\\9 - eo{9, P)\\. 
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For the second part of Assumption I4.4[ note that, since 6^0 = 9q{9,P) G 6o(-P), 

m{eo, x,P) = T- PiWl" < 00,1 + \X, = x) 

= r- PiWl" < QrAW^X, = x)\X, = x) 
+ P(0o,i + x[eo,-i qrAWl'lX, = x)\X, = x) 

= P(^^o,i + X;^o,-i <Wl' < qrAW^X^ = x)\X, = x). 

For \\x — xqW small enough, the distance between 6^0,1 +a;'6'o,_i and qT-,p{Wl^\Xi = x) will be 
less than the r] in Assumption 15.71 For x such that this holds, 

\Th{9o, X, P) - m(6'o, Xq, P) \ = Th{9o, x, P) 

= P(0o,i + ^;^o,-i <Wl' < q^AW^\X, = x)\X, = x) 

< l[qrAWl'\X, =X)- 00,1 - X 0o,-l] 

= l{[qrAWl'\X, = x)- 00,1 - X 0o,-i] - [qrAWl'\X, = xo) - 00,1 " a;[,0o,-i]}. 

Under Assumption 15.51 the second part of Assumption I4.4l then follows immediately since, for 
a < 1 and Xoll small enough, ||(a;— Xo)'0o,-i|| < ||0o,-i|| ||a;— Xo|| < ||0o,-i|| ||a;— Xo||" so that 
the expression in the above display is bounded by /(C+||0o,_i||)||x— xo||". Under Assumption 
15. 6[ Assumption 14.41 follows from a second order mean value expansion of qr^p{Wl^\Xi = xq) 
since Xq is on the interior of the support of Xj. 

□ 



proof of Theorem 5^, Everything is the same as in the proof of Theorem 15.31 except for 
the verification of the first part of Assumption 14.41 Verifying this condition uses a similar 
argument to the one in Theorem 15.21 for mean regression. For any 0, either {6[,62) G Qo{P) 
for some 0', in which case the same argument to verify Assumption 14.41 goes through, or 
02 > 02 or 02 < 02, where 02 and 02 are defined as in the proof of Theorem 15.21 (02 = 
sup{02|(0i,02) G eo(P) some 0i} and 02 = inf{02|(0i, 02) G eo(P) some 0i}). If 02 > 02 (a 
symmetric argument applies when 02 < 02), then, for some 9[, (0'i,02) G 0o(-P) and some 
xo,2 < a;o,i, qr,p{Wl^\Xi = xo,i) = 9[ + a;o,i02 and g^,p(W^^|Xi = ^0,2) = 0'i + 2;o,202- We 
have me,i(0,xo,i,P) = -fwfixS^^ + 2;o,i02|a;o,i)(l, a;o,i) and me,2(0, a;o,2, P) = fw^^ixS^^ + 
a^o,2^2|a;o,i)(l,a;o,i), so 

m,,i(0,Xo,l,P)(0- (0;,02)) = -/vK«|x,(01 +4,1^2|xo,i)(l,Xo,l)(0- (0;,02)) 
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and 



Letting ai be the expression in the first display above, and 02 the expression in the second 
display above, note that, if 

[fwl'\xX^i + a;o,i6'2|xo,i)]"^ ■ ai + [fwt\xX^^ + %,2^2|a;o,i)]"^ ■ 02 < -y 11^ " (^'i>^2)||, 

then either ai < —ri\\6 — {6[,62)\\ or 02 < ^vW^ ~ (^i7^2)||- Thus, it suffices to bound the 
expression on the left hand side of the above display divided by \\9 — {9'^, 62) \\ away from zero 
from above. The left hand side of the above display divided by ||^^ — {d[, ^2)!! is equal to 

[(l,Xo,2) - (l,Xo,l)](^- (^'lJ2))/l|^- (^^'1^2)11 - (^0,1 -^0,2) (^2- ^^2) 



By the same argument as in the proof of Theorem 15.21 this is bounded away from zero from 
above for \9i — 0[\/\92 — 6*2 1 bounded away from infinity since Xo,i — Xo,2 is bounded away 
from zero, and, for \9i — 9[\/\92 — 6^2! large enough, either m0^i{9, xq^i, P){9 — {9[,92)) or 
^e,2(6', Xo,2; P){9 — {9[, ^2)) will be less than the same negative constant for all P eV. 

□ 

proof of Theorem 15.51 The result follow immediately from Theorem 13.11 □ 

proof of Theorem \5.6\ . For the case where Assumption 15.111 holds, the result follows by ver- 
ifying the conditions of Theorem 14.21 with g a function that is positive only on [x,x]. For 
the other cases, the result will follow by verifying the conditions of Theorem 14.31 once we 
show that these models can be transformed so that Assumption 15.101 holds with (p^ in the 
transformed model equal to zero and, under Assumption 15.101 on the original model, (pm in 
the transformed model equal to 4>m/{(px + 1) and, under Assumption 15.91 (and dx = 1) on 
the original model, 0^ in the transformed model equal to (pm/i'Px — !)• (Assumption 14.61 
is invariant to taking the same invertible monotonic transformation of each element of Xj, 
since we can replace || ■ || in that assumption with the supremum norm, and then the sets 
involved are dx dimensional boxes, and the set of all c?x-dimensional boxes is invariant to 
such transformations. This holds even for the transformations used under Assumption 15.91 
in which infinity is taken to a finite support point by taking t in Assumption 14.61 to be large 
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enough so that the largest value of any component of Xj in the sample is contained in the 
dx-dimensional box.) 

Suppose that Assumption 15.101 holds for some 0m and (px- Then, for any t G M with each 
element less than r]x 

P(0 < xo.fe - Xi^k < tk all k) = P{xo -t<Xi<xo) 



- c 



^/ ■■■/ Yl\xo,k - Xkf'^ dxi ■ ■ ■ dxd^ = ?^n ^\ 1 

Jxo,i-ti Jxo,dx-tdx k=l '— ^ <Px + -L 



SO that 

Pixo,k -tk< xo,k - ixo,k - < xo,k all A;) = P(0 < xo.fe - X^^k < tl^^^""^'^ all k) 



fc=l 



Thus, the random variable l^j defined to have kth element Xo,fc — {x^^k — Xi^k)^'^''^^'^ for 
Xo — Vx<Xi< xq and Xj ^ otherwise will satisfy part (ii) of Assumption 15.101 (for a different 
value of rix) with (p^ equal to zero for the transformed variable. To get the conditional mean 
of the transformed model, note that, for xq — rjx < Xi < xq, 

Ep{Wl'\V^ = v) = Ep{Wl'\xQ,k - (xo,fc - = Vk all k) 

= Ep{Wl'\x,,k - X,^k = {xo,k - all A;) = Ep(W^^|X,,, = xo,k - {xo,k - VkY^'^^-'''^ all k) 

< CWiixo,! - v,f'^^-+'\ . . . , (xo,,, - < CdfWx, - 

Thus, Assumption 15.101 will hold for the transformed model with Xi replaced with Vi and 
(prn in the transformed model equal to (pm/i'Px + 1) and (p^ in the transformed model equal 
to zero. 

If Assumption 15.91 holds for some (pm and (px-, then, for t greater than Kx (here = 1), 



P(x.>t)>-1 ,-*.^ = ^^t- 
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Thus, 



P{Kx + 1 - -Kx + l)>Kx + l-t) = P(-1/(X, -Kx + l)> -t) 

= P(1/(X, -Kx + l)<t)= P{l/t <X,~Kx + l) 

- ^(^x - 1 + 1/t < X.) > ^j^^iKx - 1 + > c^/^-' 

where the last inequahty holds for t small enough so that 1/t > Kx — 1- It follows that 
part (ii) of Assumption 15.101 holds with 0^. in that assumption replaced by 0^ — 2 for the 
transformed random variable Vi given by \^ = Kx + 1 — ^/iXi — Kx + 1) for Xi > Kx and 
Vi = Xi otherwise. Here, xq from Assumption 15.101 is equal to Kx + 1 in the transformed 
model. As for the conditional mean of the transformed model, we have, for v close enough 
to Kx + 1, 

EpiWl'lV^ = v) = Ep{Wl'\Kx + 1 - 1/(X, -Kx + l) = v) 

= Ep{W^\ - 1/{X, -Kx + l)=v-l-Kx) = Ep{Wl'\X, -Kx + 1 = -l/{v - 1 - Kx)) 
= EpiWi'lX, = -l/{v -l-Kx) + Kx-l)< C{~l/{v -1-Kx) + Kx- 1)''^- 
< 2C(1/(1 + Kx- v))-^'- = 2C{1 + Kx- t;)*™ 

so that part (i) of Assumption 15.101 holds with the same 0^- 

□ 

proof of Theorem \6.1[ Let 6n be a sequence converging to such that, for some e > 0, 
dH{On,Qo{P)) = n~°'^^'^^^^'^°''>e, for large enough n, (conditions on how small e is will be 
stated below). Such a sequence exists by part (iv) of Assumption 16.11 For each n, let 
6'o(n) e SQo{P) be such that dH{On,0'Q{n)) < 2n-"'/^'^'^^+'^°'h (doubling the distance to the 
identified set covers the possibility that the infimum is not achieved). For each j, we have, 
for some Xq G Xq^Oq^u)) and some 6** between 9n and Oq^u), 

rhj(9n,x,P) = rhj{9n,x,P) - rhj{9Q{n),xo, P) 

= [rhj{9n,x,P) - mj{9Q{n),x,P)] + [mj{9Q{n),x, P) - rhj {9^(71), xo, P)] 
= mej{9*^,x,P){9n - 6'o(n)) + [mj{9Q{n), x, P) - mj(6'o(n), Xq, P)] 

xo&Xoie'oin)) 

where is a bound on the derivative. For n large enough, the last line of the above display 
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This will imply, letting ^ be an upper bound for functions in Q and K\ an upper bound for 
the number of elements in A'o(6'g(n)), 



^ -l/{2dx+2a) 



^-dx/{2dx+2a) 



Here, the first inequality follows for large enough n since fhj{6n, x, P) > —2Kn^°'^^'^'^^^'^°'^e 
eventually by the argument above. 

If dHiCnAcn),QoiP)) < En-^/^^^^+^''\ then 9^ i C„,,(c„), so that T„,^(^„) > CnU-^/^ > 
cn^^/^ where c is a lower bound for c„. Then, for some j and (7, we will have, letting 
Ks,i be as in Assumption 13. 4[ uj{9n, g)jln,jidn, g) < —Ks^icn'^^"^ so that, letting cJ be an 
upper bound for u{9,g), n^^'^finjiOn, g) < —Ks^ic/u. For large enough n, we will also have 
n^/'^fiPj{9n,g) > -2KeK{gf2'^^ (^) . This will imply 



n 



1/2 



{KA^n.g) - [/ip,,(^n,^?) A 0]} < ~Ks,ic/W+2KeK{gf2''^ 



2Ke\ 
V J 



dx/a 



SO that n^/^ {jlnjiOn, g) — [/^P,i(^n, fl') A 0]} is bounded away from zero from above by a neg- 
ative constant when this event holds for small enough e. Thus, it suffices to show that, for 
any 5 > 0, rt}!'^ mfc,eg {fi'njidn, g) — [l^p,j{(^n, fi') A 0]} > — 5 with probability approaching one. 
We have, for any r > 0, 

n^/^ inf {finj{On,g) - [fipj{9n,g) A 0]} 

> n^/^ inf /i„j(6'„, 5()/(/xpj(6'„, g) > r) + n^/^ inf {finjiOn, g) - /^Pj(6'„, g)}I{np,j{9n, g) <r). 

The first term is greater than zero with probability approaching one since finji^, g) converges 
to (j,p^j(9, g) at a root-n rate uniforrn ly over {9, g) by standard arguments (e.g. Theorem 2.5.2 



m 



van der Vaart and Wellnerl ( 1l996l )) 
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As for the second term, note that, for any 81,82 > with 8^ < t], Thj{9n,x, P) will be 
greater than or equal to —82l{dH{x,XQ{9Q{n))) < 8i) + {ri8^ — 82)I{dH{x,XQ{9Q{n))) > 81) for 
large enough n by (IH])- To simplify notation, define the sets Anfy = {x\dH{x, Xq^Oq^ti))) < 
81}. Using this notation, the above observation implies that, for n greater than some constant 
that depends on 81, 

/ip,,(^„, (?) = Epm,{en,X,,P)g{X,) > -82Epg{X,)I{X, G A^^sJ + (v^? - 82)Epg{X,)I{X, i A^^, 

If fipj{6n,g) < T, then this means that 

(r/5r - 82)Epg{X,)I{Xi i An,5,) < 82Epg{X,)I{X, G An,5,) + r 

where, as above, Ki is an upper bound for the number of elements in A'o(6'q(?7.)). Thus, for 
fJ^pji^nyg) < and n larger than some constant that depends only on 81, letting ^ be a 
bound for g{Xi) and M a bound for rrijiWi, 9), 

Ep[m,{W„en)g{X,)]' < gM'Epg{X,) = gM'[Epg{X,)I{X, i A„,,J + Epg{X,)l{X, G A^^,,)\ 

< gMH[82Epg{Xi)I{X, G A^^sJ + r]/ W - '^2) + ^p^?(X,)/(X, G A^^s,)} 
82 A „ . , . r 



gM' 



<gM' 



V8? - 82 
82 

V8f - 82 



+ ?j Epg{X,)I{X,eAn,5,) + ^ 



-80 



+ l\gK^{28,Y^ + 



r/5f - 82 



By choosing r, 81, and 82 so that 81, r/{ri8i — 82) and 82/ {ri8i — 82) are small, we can make 
the last line of the display less than any ^3 > 0. Then, for n large enough, fipj{9n,g) < r 
will imply varp[m{Wi,9n)g{Xi)] < 83, so that 



n^/^ inf {p,n,jiOn, g) - fipjiOn, g)} I{iJ,pj{6n, g) < 
gee 



r . 



> inf {fLnj{9n,g) - fxpj{9n, g)} Iivarp[m{Wi,9n)g{Xi)] < 8^). 

This can be made arbitrarily small in magnitude by the stochastic asymptotic equicontinuity 
of n^^'^{En — Ep)m(Wi, 9)g{Xi) with respect to the covariance semimetric p{{9, g), {9', g')) = 
varp[m(Wi,9)g{Xi) — miWi, 9')g' (Xi)] as a sequence of processes indexed by {9,g). Letting 
g{x) = be the zero function and 9 an arbitrary value in 0, the last line of the above display 
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is equal to 



inf (E„ - E)mj{W„9n)gj{X,) - - ^)m,(W„ ^~)^,(X,) I{p{{9n,g), {9,~g)) < Ss). 

By making ^3 small, the probability of this being less than any negative constant can be 
made arbitrarily small by equicontinuity of ii}^'^{En — E)mj(Wi,6n)gj{Xi) in p. 

□ 

proof of Theorem \6.2[ By the same argument that gives ([7]) in the proof of Theorem I4.3[ 
we will have, for 6 with dniO, 6o(-P)) smaller than some constant that does not depend on 
P, there exists a 60 e Qo{P), jo and g e G with g{x) > Cg^il{\\x - xo\\ < [r//(4C)]i/"||0 _ 
Oo\\^^°'Cg^2) such that 

pp,,^^{9,g)=Epm,,X0,X„P)g{X,) < -{r]/4)\\9 - 9o\\Epg{X,). 
This, and the lower bound on g gives 



Thus, the conditions of Lemma [A.5I hold with 7 = a/{dx + «)• 



□ 



The proof of Theorem 16.21 uses the following lemma, which is analogous to Theorem 14.21 
for set estimates based on variance weighted KS statistics. 

Lemma A. 5. Suppose that, for some positive constants C , 7, and 5, we have, for all P eV 
and 6 with dn^O, Qq{P)) < 5, 

Mpp,,{9,g)<-CdH{9,eo{P)y/^ 

where the infemum is taken over g & Q and j G {1, . . . , dy}- Suppose that Assumptions \3. 1\ 
\3.2i \3.3[ \3.4\ and \4-l\ hold, and that the weight function ujn{0,g) satsifies u < Un{6,g) < oJ 



for some < u < u < 00, and suppose that — )■ 00 with Cn/y/n — )• 0. Then, 

mf^P(Go(P)CC„,,(cO)"^°"l 
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and, for some large B, 



sup P ({n/d'y dniCnicn), QoiP)) 



> b) 0. 



Proof. First, note that, for all ?", suvn^ \/n \ ( En — E)r n j(Wi, 9)qj(Xi)\ = Op{l) uniformly 
in P by Theorem 2.14.1 in Ivan der Vaart and Wellnerl ( 119961 ) (the constant function equal 
to Y does not depend on P and can be used as an envelope function). This, along with 
Assumption 13.41 and the bound on the weight function, implies the first claim. 

For the second claim, once Qo{P) ^ C„,(^(c„), if {n/ c^)'^^'^ dH{Cn{cn) , Qo{P)) > B, there 
will be a ^ e C„,^(c„) such that dH{0,eo{P)) > B^. If c/h(C„,<^(c„), eo(P)) < 6, which 
happens with probability approaching one uniformly in P G P by arguments similar to 



the proof of Theorem 14. H then, for this 6 and P, there will be a g* and j* such that, for n 
greater than some constant that does not depend on P, ^pj*{9, g*) < — (C/2) (c^/n)^^'^ B^/^ . 
Since 9 G C„,aj(c„), we will also have Tn,uj{9) < c„r?,^^/^, so that fln,j*i9, g*)un,j*{9, g*) > 
—Cnn~^^'^Ks,2- By the lower bound on the weight function, this implies jln,j*i9,g*) > 
-Cnn~'^/'^Ks,2/ui- Thus, 

,/^[fi^^,49,g*)-fip,,*{9,g*)] > c„ [-Ks,2/ui+ {C/2)B'/^] . 

For B large enough, the right hand side will go to infinity. Since the left hand side is Op{l) 
uniformly in P G P, this gives the desired result. 

□ 

proof of Theorem \6.3[ Let 9n and 6*0 (ri) be as in the proof of Theorem 16. 11 but with dH{9n, 0o(P)) 



\/log n 



If dHiCl-^icn), eo(P)) < e (^-^ V , then 9^ i 



l^^^iK) so that T„k:i-,„(e„) > c„ 



Then, letting Ks^\ be as in Assumption 13. 4[ we will have, for some j and x, rhiix, 9r,) < 



—Ks,iCn- By Lemmas IA.6I and IA.71 for large enough a we will have, for some constant K 



sup . 



{En - Ep)m,{W,,9)k{{X, - x)/K) 



< K 



(9) 



Enk{{X, - x)/K) 

with probability approaching one (Lemma I A. 61 allows Epk{{Xi — x)/hn) to be replaced by 
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its sample analogue in Lemma lA.7p . When T^'^f'^ (6'„) > c„, we will have 



- Ep)m,{W,, en)k{{X, - x)/K) ^ Epmj{W„ en)k{{Xi - x)/K) 



Enk{{X, - x)IK) Enk{{X, - x)/hn) 



dx ^ 

mj{x,9n) < -Ks,ic„ 



so that, when (P) holds, we will have 

Epm,{W„en)k{{X,-x)/K) 



< -KsA&n + K. 



\/\ogn Enk{{Xi - x) / K) 
Appealing again to Lemma IA.6[ if a is large enough, this will imply 



Epmj{W,,en)k{{Xi - x)/K) ^ -Ks,iCn + K 
v45i^ Epk{{Xi-x)/K) - 2 

Letting r] be as in Assumption 14.51 letting ei > and £2 > be such that k{t) > e\ for 
< £2 and defining K\ = rjSie'^^ , we have Epk{{Xi — x)/hn) > eiP{\\Xi — x\\ < hnS2) > 
rjEie'^^h'^^ = Kih'^^ by Assumption 14.51 so that the above display implies 



Epmj[Wi,9n)k[[Xi - x)lhn) < Kih^^ — 



n 



2 Vuhi"" 2 

Let Cn be large enough so that Ki {—Ks,iCn + K) /2 < —6 for some fixed constant 6 > 0. 
Then the above display implies 



Epm,{W,,9n)k{iX,-x)/hn) < (10) 

When this holds, the right hand side will be negative, so that, by Lemma IA.8t hn < 
B[dH{On, eo(P))]^/". If h'^ > , this will imply K < e^^'^Bhn, which is a contradiction 

V nh„^ 

for e small enough. 



Now suppose /i" < ■ By the same argument as in the proof of Theorem 16.11 we 

V nh„^ 

have, for some constant K2 that does not depend on n, rhj{9n,x) > —K2dH{0n,Qo{P)) so 
that, if h'^ < , mj(^„,x) > -5X2 so that the left hand side of is greater 
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than or equal to 



-eK2^^^^Epk{{X, - x)/K) > -eK2 



og n 



so that ffTOj) imphes efK2 > 5, a contradiction for e small enough. 



□ 



The proof of Theorem 16.31 uses the lemmas stated and proved below. 

Lemma A. 6. Suppose that Assumption \6.2\ holds, and that Assumption \4.5\ and part (in) of 
Assumption \6.1\ hold, with the upper hound on the density in the latter assumption uniform 
in P E V. Then, for any e, there exists an a such that, if hf^^n/ logn > a eventually. 



sup P I sup 

PeV \xesuppp{Xi) 



Enk{{X,-x)/h) 



Epk{{X, - x)/h) 



> 6 



rt— >oo „ 





for all 6 > 0. 
Proof. We have 



E„A;((X,-x)//i„) 



Epk{{Xi-x)/hr,) 



{EpMx,-x)/h^)]'y/' 



Epk{{Xi-x)/hn) 



{E^-Ep)k{{X,-x)/h^) 



{Ep[k{{x,-x)/h^)Yy/^ 



< k 



1/2 



{En - Ep)k{{Xi - x)/K 



{Ep[k{{x,~x)ihn)YY/^ 



[Epk{{X,-x)lhn)Yl^ 



where k is an upper bound for the kernel function k. By Theorem IA.lt 



n 



sup P I sup 

p&v \ xesuppp(x,) Vlogn 



{En-Ep)k{{X,-x)/hr. 



{Ep[k{{X, - x)//l„)]2}l/2 



for large enough K (the lower bound on the denominator follows from Assumption 14. 5p . 
so the result will follow if we can show that [Epk{{Xi — x)/hn)Y^'^^/n/ ^/\ogn can be made 



arbitrarily large by choosing a large in the assumptions of the lemma. By Assumptions 16.21 
and 14.51 we have, for some 5 > and all x on the support of Aj under P, 



\n 



/{\ogn)]Epk{{X,-x)/hn) > [n/(logn)]5/i^, 



and taking the square root of this expression gives something that can be made arbitrarily 
large by choosing a large. □ 
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Lemma A. 7. Suppose that AssumptionlUT^ holds, and that Assu'mption \4.5\ and part (in) of 
Assumption \6.1\ hold, with the upper hound on the density in the latter assumption uniform 
in P E V. Then, if h'^n/ \ogn > a eventually for a large enough, we will have 



sup P sup 



{En - Ep)mj{W,, 9)k{{Xi - x)/hn) 



Epk{{Xi - X)lhr,) 



> 5 1 



for some B. 
Proof. We have 



\fnh 



dx 



{En - Ep)m,{W,, e)k{{Xi - x)/h„ 



n 



\/\ogn 



Epk{{X, - x)/hn) 
{En - Ep)m,{Wi, e)k{{X, - x)/hn) 



y/varp[mj{Wi, e)k{{Xi - x)/hn)] V 
[^Jvarp[mJ{W,,e)k{{X,-x)/hn)] V 
Epk{{X, - x)/hn) 



Since 



varp[m,{W,, e)k{{X, - x)/hn)] < YEp[k{{X, - x)lhn)f < Yf / [k{{t - x)/hnW dt 



[k{uyf' du, 



the last hne is bounded by a constant times 



{En - Ep)mj{W^, e)k{{Xi - x)/h. 



n 



Vlogn 



^varp[m,{W,, e)k{{X, - x)/hn)] V \fW 



Epk{{Xi-x)lhn)' 



By Assumptions 16.21 and 14.5^ we have, for some 5 > and x on the support of Aj under P, 
Epk{{Xi — x)/hn) > ^h'^ , so that this is bounded by 



n 



A/log n 



{En - Ep)mj{W,, e)k{{X, - x)/hn) 



^varp[mj{Wu9)k{{Xi - x)/hn)] V \fW 



■ (1/5)- 



The claim now follows from Theorem IA.lt with \/ hn^ playing the role of the cutoff point 



<J„.- 
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□ 



Lemma A. 8. Suppose that Assumptions^^ \ 6.1\ and \6.S\ hold. Let 9q be as in Assumption 
\6.1\ and let 9^ he a sequence in Q\Qq{P) converging to Oq. Then, for some constant B that 
does not depend on n and some G N, Epmj{Wi, 6n)k{{Xi — x)/h) will he nonnegative for 
hn > B[dH{9n, eo(P))]^/° andn>N for X on the support of Xi. 

Proof. Let bn = dniOn, 0o(-P))- By an argument similar to the one leading up to Equation 
dH]), we will have, for each j, 

rhj{9n,x, P) > —Cbn + rj min — Xo||" A r^) 

xoeXoie'g{n)) 

for some C that depends only on the bound on the derivative mej{6,x, P) in Assumption 
16.11 and some O'^ln) G Oo(-P). Thus, for x such that rhj{6n,x, P) < Cbn, we will have, for 
some Xq G Xq^Oq^u)), Cbn > —Cbn + viW^ ~ ^oW"" A v) so that 2C6„ > ri{\\x — Xo||" A 77). For 
bn small enough, this implies that ||a; — sqU < {2Cbn/r]Y^°'- This means that, letting be a 
bound for the number of elements in XqIOqIu)) and / an upper bound for the density of Xi, 

P(m,(^„, A,,P) < Cbn) < Kj{2Cbnhf^'''. (11) 

This, and the lower bound on mj{6n, x, P) imply, letting k be an upper bound on the kernel 

k, 

Epmj{Wi,en)k{{X, - x)/hn)I{mj{en,Xi,P) < Cbn) > -kCbnP ime,,ie,Xi,P) < Cbn) 
> -kCbn ■ ir7(2C6„/77)'^^/". 

We also have, for x on the support of Aj, letting e and Ki be such that k{t) > Ki for 
||t|| < e, 

Epmj{Wi,en)kiiXi-x)/hn)Iimj{en,Xi,P) > Cbn) 

> CbnEpk{{Xi - x)/hn)IiThj{en,X„P) > Cbn) 

> KiCbnEpI{\\{X,-x)/hn\\ < E)Iim,{en,Xi,P) > Cbn) 

> K^Cbn[Pi\\{Xi-x)/hn\\ <e)- P{mj{en,X„P) < Cbn)] 

> K^Cbnive'^'^h't - A7(2C6„/r/)'^-/"]. 

The last inequality follows from Assumption 14.51 and from the inequality ( |TT|) above (here the 
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two 77s come from different conditions, but they can be cfiosen to be the same by decreasing 
one). Combining this with the bound in the previous display gives 



EprUjiWi, en)k{{Xi - x)/K) > K^Chn[r]e''''ht - K f{2CbnhY^'''] - kCh^ ■ K f{2ChnhY^''' 

where K2 = KiC-qe'^^ and = KiCKj{2C/riY^/'' + kCKj{2C /-qY^'" are both positive 
constants that do not depend on n. For /i„ > [K^/ K2Y^'^^bn" ^ this will be nonnegative. 

□ 
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